Bulletin, June/July 2006

What's New?
Selected Abstracts from JASIST


Authors who choose to do so prepare and submit these summaries to the editor of the Bulletin.

From JASIST v. 57 (3)

Elovici, Y., Shapira, B., & Kantor, P.B. (2006). A decision theoretical approach to combining information filters: Analytical and empirical evaluation, 306-320.

Study and Results: This paper asks how an information professional can make best use of multiple search engines when filtering streams of data. The hypothesis, which is verified, is that different ways of combining the information will be appropriate for users with different value schemes. It is found that different logical fusion rules should be used for different value schemes. The result is verified by experiments using the TREC collection and two different filtering methods.

What's New? Careful application of this approach could improve the value of filtering systems used in support of the financial industry, the intelligence community, and other areas.

Limitations: The work is mathematically rigorous, being an extension of Blackwell's Theorem, but the size of the effect will depend on the specific collection and standing queries.

Asonuma, A., Fang, Y., & Rousseau, R. (2006). Reflections on the age distribution of Japanese scientists, 342-346.

Study and Results: The age distribution of Japanese scientists is investigated to determine whether major non-demographic events such as World War II had an appreciable effect on its features. Contrary to the Chinese situation where the effects of the Cultural Revolution are clearly visible, no such effect was found. Yet, it was found that the baby boom generation, born after World War II, dominates the scientific landscape.

What’s New? It is shown that World War II itself had no influence on university enrollments in Japan . Yet under the influence of the American occupation Japanese universities were for the first time opened to female students. This effect is clearly visible. Generally, female participation in the scientific and university systems in Japan is low, though slowly increasing.

Limitations: Results of the population census held in the year 2000 were not yet available at the time of writing.

Sun, A., & Lim, E. -P. (2006) Web-unit based mining of homepage relationships, 394-407.

Study and Results: We study the problem of mining the relationships among homepages on the same website. Homepages are usually the main targets for searching and browsing. They, together with relationship instances among them, facilitate semantic-based information retrieval on websites. In this research, we adopt a classification approach in homepage-relationship mining and investigate the features to be used. We identify three types of inter-homepage features, namely, navigation, relative-location, and common-item features. We also propose deriving for each homepage a set of support pages. The homepage together with its support pages are known as a Web unit. Our experiments on the WebKB dataset showed that by extracting inter-homepage features from Web units, better homepage-relationship mining accuracies can be achieved compared with using features derived from individual homepages.

What’s New? The problem of homepage-relationship mining is formally defined and three types of inter-homepage features are carefully studied. These features can be derived from either individual homepages or Web units that contain more complete information about the homepages.

Limitations: Experiments were conducted on WebKB dataset, which is small relative to the size of the Web today. Experiments on larger datasets and datasets from different domains will yield more interesting results.

Yoon, Y., Lee, C., & Lee, G. (2006). An effective procedure for constructing a hierarchical text classification system, 431-442

Study and Results: Hierarchical classification can provide solutions to effectiveness and efficiency problems of practical text classification tasks. We devised a new evaluation technique applied to internal classifiers (nodes), which guarantees more opportunity of classification to the lower classifiers (nodes or leaves) in the hierarchy, hence upgrading the overall classification performance. We could get more improved classification accuracy than any other methodology in the experiment that used 20 newsgroups and OSHUMED as test data collections.

What’s New? Our method is based on the new evaluation scheme for internal classifiers and is systematic and well defined in its classification procedure. Therefore, it can effectively be applied to the practical classification task with very large number of documents and categories. In addition, sacrificing a slight decrease in accuracy, we could save the training time dramatically.

Limitations: Our hierarchical classification system adopts the top-down level-based approach in classifying hierarchically, thus cannot be applied to other hierarchical methods such as the big-bang approach that determines classes in single classifying run.

From JASIST v. 57 (4)

Shiri, A., & Revie, C. (2006). Query expansion behaviour within a thesaurus-enhanced search environment: A user-centred evaluation, 462-478

Study and Results: The query expansion behavior of end-users interacting with a thesaurus-enhanced search system on the Web was investigated. Thirty searchers – academic staff and postgraduates – at a university were recruited to perform search tasks based on their own genuine information requests. The results indicated that thesauri are capable of assisting end users in the selection of search terms for query formulation and expansion, in particular by providing new terms and ideas. In 50% of the searches where additional terms were suggested from the thesaurus users stated that they had not been aware of the terms at the beginning of the search. This observation was particularly noticeable in the case of postgraduate students.

What’s new? The main contribution of the study lies in the finding that academic searchers representing various levels of subject knowledge can benefit from thesauri for search term selection and query expansion purposes. The results also have implications for online user education. It is recommended that online and database-searching courses should incorporate training on thesaurus-based search options to improve user performance by conducting high quality searches.

Limitations: This study employed a commercial retrieval system. Therefore, the search states defined to represent typical thesaurus-based search stages were restricted to those features.

Cheung, C.M.K., & Lee, M. K. O. (2006). Understanding consumer trust in Internet shopping: A multidisciplinary approach, 479-492.

Study and Results: A survey of 278 university students in Hong Kong, aged 18-20, found that trustworthiness of Internet vendors (competence, integrity and security control), legal framework and third-party recognition (e.g., TRUSTe, BBBOnline, Verisign and the like) are important factors for trust building in the online environment.

What’s New? The fact that trust is a fundamental component influencing Internet shopping comes as no surprise. This study further synthesizes diverse theories of trust and develops a framework that provides significant explanation of trust and offers important insights in trust formation strategies. The main lesson to be learned is that third-party recognition is the central element in developing trust in the online environment. Internet vendors should affiliate with trusted third bodies and acquire the third party’s seal of approval to endorse their security policies. Interestingly, consumer’s propensity to trust does not play any role in building trust. Our findings support the societal culture of trust where propensity to trust is lower for individuals from collectivist cultures (e.g., Hong Kong ).

Limitations: Relatively homogeneous student samples were used in this study. Only replicating this study using different sampling units can assess whether the results are applicable to other Internet users in other cultures.

Shen, X., Li, D. & Shen, C. (2006). Evaluating China ’s university library websites using correspondence analysis, 493-500.

Study and Results: This paper applies correspondence analysis to analyze five aspects of 15 university libraries in China with data obtained from the Alexa database. The 15 sample websites are classified into three groupings: “connectivity group,” “visits group” and “others group.” This grouping reveals that different library websites focus on different positioning strategies.

What's New? There are many differences between the libraries and business companies; however, library websites without a geographic-position advantage become competitive independent individuals in cyberspace. This evaluation found clear differences among them. Moreover, this study is the first time to utilize the correspondence analysis in a library field in a way that contributes to library website construction. We also found that building China ’s own website data warehouse is increasingly important.

Limitations: There are limitations in the data provided by Alexa. For example, the Alexa toolbar presently supports only Internet Explorer, and thus will decrease the websites with non-IE browsers in its calculations. Also, some other evaluation standards such as authority are not included in the evaluation model.

From JASIST v. 57 (6)

Wu, Y., -F. B., Li, Q., Bot, R.S., & Chen, X. (2006). Finding nuggets in documents: A machine learning approach, 740-752.

Study and Results: Document keyphrases are highly useful; they can be used as metadata for documents or to develop a glossary. This paper describes a Keyphrase Identification Program (KIP). The logic of our algorithm is the more keywords a candidate phrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. KIP has a system glossary database storing prior positive samples of human-identified phrases, which are used to assign weights to the candidate phrases. The evaluation results show that KIP has better performance than the systems we compared it to.

What's New? Besides KIP’s methodology, this paper also introduce KIP’s two other unique features: the learning function, which can enrich the system glossary database by automatically adding new identified keyphrases, and the personalization feature, which can help a user build a glossary database specifically tailored for the area of his/her interest.

Limitations: The coverage of the prior positive samples of human-identified inputs influences the performance. However, enabling the learning function can rectify deficiencies in the samples.

Chen, L, Zeng, J., & Tokuda, N. (2006). A "stereo" document representation for textual information retrieval, 768-774.

Study and Results: Human experiences, which show that stereo audios and videos have always been beneficial in acoustic and visual recognition, lead us to our belief that perceiving objects from two or more perspectives is always an advantage. The purpose of this article is to discover if "stereo" views of a textual object (i.e., a document) are helpful for information retrieval purposes. Experiments on two standard corpora have illustrated that both the standard term-vector method and the latent-semantic-indexing method are able to achieve significant improvements by adopting the stereo representation model.

What’s New? Paralleling stereo audio and video, the concept of "stereo" document representation is proposed. Although this concept is only used in limited experiments in this article, we expect it to be applied as a general principle to many textual retrieval approaches.

Limitations: Experiments on large corpora will be necessary for further verification of the observed improvements. For an arbitrary corpus, it is not yet clear what constitutes a general condition for an optimal overlapping rate for multiple "stereo" views of each document.