Bulletin, October/November 2006

What’s New? Selected Abstracts from JASIST

Authors who choose to do so prepare and submit these summaries to the editor of the Bulletin.

From JASIST v. 57 (7)
Wang, J. (2006). Automatic thesaurus development: Term extraction from title metadata, 907-920.
Study and Results: Titles of scientific and technical publications usually contain rich subject-bearing terms, reflecting ever-advancing domain knowledge. In metadata each title is described with subject headings, the controlled vocabulary from thesauri. This paper explores a novel thesaurus-updating mechanism that discovers new terms from titles and adds them into the thesaurus as entry vocabulary by analyzing the associations between the terms extracted from bibliographic titles and subject descriptors embodied in the metadata records. A core concept is generalized from the associated subject descriptors to which the extracted term is mapped as a narrower term or equivalent term. The effectiveness of the method was demonstrated by an experiment over the Chinese Classified Thesaurus and China MARC records. 

What’s New? The approach proposed here can be utilized as a mechanism for terminology discovery and for facilitating thesaurus revision. The most promising application is in the automatic indexing: finding substantive keywords in titles to be cataloged, then mapping them to thesaurus terms.

Limitations: This approach requires that titles reflect the subject matters of the literature content. This is not true for art and literature publication.

Conrad, J.G., & Shriber, C.P. (2005). Managing déjà vu: Collection building for the identification of non-identical duplicate documents, 921-932.
Study and Results: As online document collections continue to expand, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. This work investigates the phenomena of near duplicates and algorithmic approaches to minimizing it. The authors enlist practitioners into their research and development framework from the front-end (via user representatives) to the back-end (via professional assessors) and by harnessing metrics of consistency, such as the Kappa statistic, to quantify their contribution. By applying a variety of such domain expertise, the authors effectively characterize the duplication existing in the large textual collections that they service. They validate the completeness and reliability of this effort with analyses of assessor agreement, error rates and significance. The resultant test collection ultimately helped to produce an algorithm that captures 97% of the duplicates identified by human experts.

What's New? Besides the principled methods this work establishes and implements, it also makes another novel and meaningful contribution to the research community. It creates a deduping test collection by harnessing (a) real user queries, (b) a massive collection from an operational setting and (c) professional assessors possessing substantial knowledge of the domain and its clients. In addition, the work expands the discussion of online (real time) deduping. [The test collection developed for this project is available for non-commercial research purposes upon request from the authors.]

Limitations: The authors’ principal test collection was developed around a large set of query-generated news documents, so the extent to which it may extend to more specialized documents remains an open question. Although the size of the resultant document set serving as the deduplication collection is not massive, the procedures to generate such a test collection in operational settings has shown itself to be reliable and reproducible.

From JASIST v. 57 (8)
Larivière, V., Archambault, É., Gingras, Y. & Vignola-Gagné. (2006). The place of serials in referencing practices: Comparing natural sciences and engineering with social sciences and humanities, 997-1004.
Study and Results: Using references made by all articles indexed in the CD-ROM versions of the SCI, SSCI and AHCI databases from 1981 to 2000, this paper quantifies the share of citations made to serials and other types of literature. We show that journal literature is increasingly important in the natural and social sciences, but that its role in the humanities is stagnant and has even tended to diminish slightly in the 1990s. Since journal literature accounts for less than 50% of the citations in several disciplines of the social sciences and humanities, special care should be taken when using bibliometric indicators that rely only on journal literature.

What’s New? Previous studies of publication practices are mostly limited to either national or institutional levels or represent only a static view of publication practices. The importance of journal literature in the various scientific fields has therefore not been systematically characterized nor has its evolution in time been analyzed. Our paper addresses these issues by providing a systematic measurement of the role played by journal literature in both the natural sciences and engineering and the social sciences and humanities.

Limitations: Data for this paper are drawn from journal literature; they might not be representative of other types of literature. 

Ercegovac, Z. (2006). Multiple-version resources in digital libraries: Towards user-centered displays, 1023-1032.
Study and Results:
We asked i) What characteristics of bibliographic relationships are found in a sample of entities in the OCLC’s WorldCat, and ii) What is the capability of expressing bibliographic entities and relationships in current cataloging standards? Descriptive survey methodology was used to examine manually 86 cataloging entries in the sample of Abbott’s science fiction work Flatland in the attempt to transform current lists of cataloging entries into hierarchical displays based on the FRBR (Functional Requirements of Bibliographic Records) entity relationship (E-R) model. Of the three groups of entities that this model proposes, we applied the Group 1 hierarchy of entities (work, expression, manifestation, item) to a sample of cataloging entries. Since the display of entries depends on current cataloging rules (AACR2r) and the existing MARC format, neither of which is based on the FRBR model, manifestations did not collocate under expressions in our experiment. By contrast to FRBR, displays of catalog entries for many library catalogs is in a “last in, first out” sort, not exactly the way searchers would expect.

What’s New? The study's findings might shed light on a navigational capability in networked digital libraries by assembling bibliographic records into interrelated clusters and displaying these according to the FRBR entity-relationship model. Instead of having to browse through multiple screens of entries, the user is informed about bibliographic landscape on a single screen, which then can be hierarchically viewed along labeled arcs and nodes. This study found that a small number of parents populated nearly 60 percent of the English language books in the sample. This finding is used to compact results into well identifiable clusters of entries that could be further explored by searchers.

Limitations: This case study uses a sample of a particular work (Abbott’s Flatland) and the obtained results cannot be generalized to other science fiction works. However, the Bradford power law has been again validated in the work that is not as popular as those written by Dickens, Shakespeare, Zola, Tolstoy, Dickinson and others. 

Bates, M. J. (2006). Fundamental forms of information, 1033-1045.
Study and Results: This was not an empirical study; rather, information is defined at a fundamental level; then, building on that basis, several related forms of information are defined and discussed.
What’s New? Everyone disagrees on what information is. This is an effort to define it so fundamentally that further definitions in information science can build upon this basic one. The article starts with information as the pattern of organization of matter and energy. It is these patterns of organization that animals and humans detect and give meaning to, whether we discern a chair before us or learn the population of Turkey.

Further fundamental forms of information defined and discussed are the following: natural, represented, encoded, embodied, experienced, enacted, expressed, embedded, recorded and trace information. Ways to use these concepts in the study of information seeking behavior and information genres are discussed. Finally, drawing on these fundamental forms, a distinction is made between information sciences and curatorial sciences.

Limitations: These are one person’s ideas; the ultimate test is whether we in this field find these terms useful in theory and practice.

Hupfer, M., & Detlor, B. (2006). Gender and Web information seeking: a self-concept orientation model, 1105-1115.
Study and Results: Rather than assuming the equivalence of sex and gender, the self-concept orientation model specifies the measurement of gender-related self-concept traits known as self- and other-orientation. Survey research found that these traits interact to predict how often individuals search for information that is important to them personally versus relevant to someone close. For highly self-oriented users, search for both self- and other-relevant information depended on their other-orientation level. High-self/high-other individuals, with a comprehensive processing strategy, searched most often while high-self/low-other respondents, with an effort minimization strategy, searched the least.

What’s New? If traditional gender distinctions continue to erode, information-seeking models that incorporate individual differences will be increasingly useful. Our results also indicate that practitioners should consider asking users to complete a self- and other-orientation personality quiz before searching. Then high-self/high-other users could receive tips for streamlining search, while high-self/low-other users could be prompted to broaden their information seeking.

Limitations: Most of the study participants were undergraduate students who provided self-reports of searching frequency.

From JASIST v. 57 (9)
Vaughan, L. (2006. Visualizing linguistic and cultural differences using Web co-link data, 1178-1193.
Study and Results: The study examined Web co-links to Canadian university Websites. Co-link data were collected in ways that would reflect three different views: the global view, the French Canada view and the English Canada view. Visualization techniques were applied to the three data sets. The results accurately reflected the ways Canadians see the universities and clearly showed the linguistic and cultural differences within the Canadian society.

What’s New? Results from the study showed that Web co-linking is not a random phenomenon, but rather it reflects social relationships. It is conceivable that co-link analysis of universities in different countries can provide useful information on academic, cultural, political or social ties among them. The methodology used in this study can be extended to investigate entities other than universities, such as governments or international organizations. More importantly, regular data collection for a group of websites can allow analysis of changing relationships over time.

Limitations: University websites from a particular country were examined; therefore applicability of the conclusions to other countries is unknown

Yi, K., Beheshti, J., Cole, C., Leide, J. E., & Large, A. (2006). User search behavior of domain-specific IR systems: An analysis of the query logs from PsycINFO and ABC-Clio’s Historical Abstracts/America: History and Life, 1194-1207. 
Study and Results: This study identified subject-specific searching patterns in the context of selection of terms for user queries, obtained from the analysis of query logs of two subject-bound IR systems: PsycINFO for psychology and ABC-Clio for history. The results found that the patterns of query term selection were distinct between PsycINFO and ABC-Clio, indicating the importance of subject domain in determining user searching behavior. For PsycINFO, users tended to select terms corresponding to the concept-based descriptors of PsycINFO’s classification system, while the ABC-Clio users tended to select query terms that contained specific instances of regions, people and historical events. 

What’s New? The main contribution of this study is providing empirical evidence that the searching behavior for subject-bound, reference (or scholarly) IR systems is distinct from the searching behavior of general Web search engines like Excite, but that the huge popularity of general Web search engines may have affected reference search engine searching, encouraging shorter query length compared to the 1990s. The study provides a technique of analyzing query logs based on multi-word terms (MWT), instead of the usual single word, word pair or term co-occurrence analysis techniques. The MWT technique provides search engine developers a methodology for (i) directing their users to system thesauri for greater searching efficiency and (ii) monitoring the evolution of a domain for thesaurus or descriptor updating.

Limitations: Because of the rapid evolution of the system interfaces analyzed, the study only briefly considered the effect of system interface design on influencing user selection of query terms.

From JASIST v. 57 (10)
Pan, B., Gay, G.K., Saylor, J.M., & Hembrooke, H.A. (2006). One digital library, two undergraduate classes, and four learning modules: Uses of a digital library in classrooms, 1315-1325.

Study and Results: This research studied the uses of a digital library in one undergraduate geometry class and one undergraduate robotics class. The KMODDL (Kinematic Models for Design Digital Library) is based on a historic collection of steel and bronze models. The digital library contains textual materials, QuickTime Virtual Reality movies, Java simulations and stereolithographic files of the physical models. Through a multi-method approach, including surveys, ethnographical observations, interviews, document analysis, videotaping, software logging and screen capturing, this research revealed that the users in different classes encountered different usability problems and reported quantitatively different subjective experiences. Depending on the subject area, the two user groups preferred different types of modules, resulting in different uses of the materials and different learning outcomes.

What’s New? This research reported dynamic development and uses of a digital library in different educational settings.

Limitations: The applicability of the results to other types of digital libraries is limited since the specific digital library studied in this research.