 |
| |
In This Issue |
704 |
In this issue Bert Boyce |
| |
Research Article |
706
|
Graph Structure in Three National Academic Webs Power Laws with Anomalies Mike Thelwall and David Wilkinson Published online 16 April 2003Thelwall and Wilkinson use crawls of university web sites in the UK, Australia, and New Zealand
to generate all links targeted at same country university web sites which they then use to create a graph structure for study. Using Broder's study as a model they identify a strongly connected component, SCC, where one
could start anywhere in the set and reach every other page, and an Out component whose pages can be reached from all strongly connected pages but provide no link back to that set. The other components in the Broder
model are not accessible except with access to a major search engine database. In link and out link counts for all three university systems in both the Out and SCC components when graphed logarithmically display the
linear nature which would indicate that power laws, and a success breeds success phenomena, are generally in effect. However, automatically generated pages, non-HTML web pages, and large resource-driven sites all were
associated with anomalies in this observation.
|
713
|
Efficient Single-Pass Index Construction for Text Databases Steffen Heinz and Justin Zobel Published online 16 April 2003Zobel and Heinz review file inversion processes for the creation of text indices and suggest an
efficient single pass approach. Complete in memory indexing remains impractical for very large files. Current rapid algorithms require that the entire vocabulary of the collection be kept in memory. This approach
creates inverted files in memory for sequences of documents until memory resources are exhausted, then transferring the lexicon and inverted file in lexicographical order to disk for subsequent merger. Each term is
assigned a dynamic in-memory bi-vector that accumulates postings in a compressed d-gap format. The lexicon is maintained in a burst trie file structure where leaves are containers of strings with common prefixes.
Performance on five gigabyte to twenty gigabyte files is fifteen to twenty percent faster than a sort based approach.
|
730 |
Automatic Construction of English/Chinese Parallel Corpora Christopher C. Yang and Kar Wing Li Published online 16 April 2003Yang and Li describe the automatic matching of English and Chinese document titles, by
character and word matching based upon the study of web pages within a site where some pages exist separately in each language. Word and character alignment is followed by redundancy resolution, and then title alignment
takes place. English words are translated into Chinese character string words by dictionary lookup and the various possibilities matched with the Chinese titles using the longest common sequence of characters. Using
Hong Kong Special Administrative Region government press releases and releases from the Hong Kong and Shanghai Banking Corporation, they find 31,567 in the Chinese language and 30,810 in English, but only 23,701
released in parallel. There are no links between the versions. With Recall as the number of system correct matches over the actual matches in the file, and Precision the number of correct system matches over the number
of system matches, a test yields Precision in the range of .998 to 1.00 and recall from .806 to .948. Thus links to parallel documents in the other language could quite likely be automatically generated.
|
743
|
Mning Longitudinal Web Queries Trends and Patterns Peiling Wang, Michael W. Berry, and Yiheng Yang Published online 16 April 2003Wang, Berry and Yang log hit counts and date stamped queries to the University of
Tennessee website for a four year period as entered through the SWISH search engine as Boolean statements where spaces were considered to be AND operators. Queries were parsed into words and word pairs of adjacent words
or words separated by one other word. (94% of queries contained three words or less) URLs were not parsed but treated as unusual queries. Null outputs exceed 30%. Queries averaged 2 words or 13 characters. Number of
queries and the vocabulary used grow over time but the vocabulary is relatively small and includes a large number (26%) of misspelled words and personal names. Log plots of frequencies and ranks for both all words and
words with unique frequencies overlap in the upper portion which is quadratic polynomial and diverge in the lower portion where the all word line becomes linear. Topics and search behavior vary little over the four year
period. Websites could be improved by containing content identified from queries.
|
759
|
Students' Conceptual Structure, Search Process, and Outcome While Preparing a Research Proposal A Longitudinal Case Study
Mikko Pennanen and Pertti Vakkari Published online 16 April 2003Pennanen and Vakkari use 22 undergraduate psychology
students doing Boolean searches on PsycINFO, a system with which they were unfamiliar, to investigate the relationship between their conceptual structure of their topic and their search process, and whether these
relations vary depending upon their stage in the Kuhlthau model. Students searched both at the beginning and the end of their construction of a proposal, and each search was proceeded and followed by an interview. The
thought process during search was vocalized and recorded, and transaction logs were also retained. They recorded the number of concepts used by a student, the proportion of sub-concepts included, and the proportion of
concepts expressed in query terms. Retrieved useful references were recorded. The two main tactics used by the subjects were the adding of a conjoined term, and the replacement of an existing term with another. The
students were able to translate into query terms only slightly more than half the concepts they identified. The subjects advanced significantly in terms of the Kuhthau model between their search sessions. Their
conceptual structure was richer, search terms used increased, but references accepted as useful decreased. The proportion of concepts articulated in the query correlated significantly with the number of useful
references found.
|
771
|
Information Science Abstracts Tracking the Literature of Information Science. Part 2 A New Taxonomy For Information Science
Donald T. Hawkins, Signe E. Larson and Bari Q. Caton Published online 16 April 2003Using 3000 Information Science
Abstracts abstracts, Hawkins, Larson, and Caton test the validity of a new ISA classification structure for information science leading to the revision and fine-tuning of the structure. The structure was produced by
collecting terms from available vocabularies grouped into 13 main headings. Each abstract was given only one classification number representing a main heading and a single sub-heading by each of the researchers. A
review of the distribution of abstracts over section indicated the combination of some closely related categories and the presence of unclassifiable abstracts pointed to uncovered gaps. Only in 19% of the cases did all
three disagree on the assignment of a main heading. A second test with 1265 abstracts showed that the abstracts were well distributed over what were now 11 main sections. Low posted sub-headings were examined but
retained as growing areas. The taxonomy is included as an appendix.
|
782
|
Improving the Search Environment Informed Decision Making in the Search for Statistical Information Stephanie W. Haas Published online 16 April 2003Studying the Bureau of Labor Statistics' LABSTAT database, Haas looks for searching decision points at which
assistance for the searcher may be of value. A searcher has some measure of both search and domain knowledge and will need some knowledge of the way information is provided to make effective decisions. Transition points
are identified between user vocabulary and Bureau of Labor Statistics (BLS) concepts, between BLS concepts and BLS data and information products, and between these products and the actual query. This suggests the need
for help in concept definition, ambiguity resolution, and synonym usage which is not uncommon in retrieval systems, but also assistance in the choice of products through a matrix of specifiable categories with available
surveys and series. The need to express a query for a chosen survey/series suggests a need for variable displays, information on the interaction of variable-value choices, and warnings of unusual situations.
|
798
|
BRIEF COMMUNICATIONUsing the Mann-Whitney Text on Informetric Data
John C. Huber and Roland Wagner-Döbler Published online 16 April 2003Huber and Wagner-Dobler demonstrate a relatively simple procedure for implementing, using a
spreadsheet, a Mann-Whitney test of the difference of two bibliometric samples which will take into account the large number of ties normally present in such data. Sources with the same count of publications are
assigned the same rank where the value is the median of the number of such sources in both samples. The lower the p-level the higher the probability the samples are from different distributions. It is thus possible to
determine if a change in productivity is due to factors beyond the change in number of sources. However, small samples with small differences will appear to be from the same distribution, and larger samples are
necessary to overcome the effect of multiple ties.
|
| |
Book Review |
802 |
Patents, Citations & Innovations A Window on the Knowledge Economy, by Adam B. Jaffe and Manuel Trajtenberg Reviewed by Chaomei Chen Published online 16 April 2003 |
| |
Letters to the Editor |
804 |
Empirical Evidence of Self-Organization" a rejoinder Loet Leydesdorff |
805 |
Arguments for Epistemology in Information Science Birger Hjorland |
|