Bert Boyce


Web Search Strategies and Approaches to Studying
Nigel Ford, David Miller, and Nicola Moss
Published online 25 February 2003

In this issue Ford, Miller and Moss utilize 68 volunteers from a population of  250 Master's students to complete on the web three search tasks with clear fact based goals and three or less facets. One task required broadening the search concepts from those given, a second provided a specific terminology for one facet but required a second facet that would require translation, and the third required general to specific transformation. The students were measured as to their performance on Entewistle's Revised Inventory of Approaches to Studying providing values for ten study variables and asked to assess their experience on the Internet, with Alta Vista, and with Boolean search. Searches were conducted on Alta Vista using Netscape Navigator 4 with participants free to choose and switch Boolean, best match or combined search modes at will while a front end script recorded all submitted searches and help access. Search related variables extracted were from  Boolean only queries, best match only queries, and combined queries. Factor analyses were conducted on all variables for each search mode for each search. In task one Boolean is differentiated from best match search by sharing high loads on active interest, intention to reproduce, fear of failure, and relating ideas. The combined searcher is linked with the best match searcher with low active interest, low intention to reproduce and low fear of failure. In task 2  Boolean is differentiated from best match search by sharing high loads on intention to reproduce and low on intention to understand. Best match loads positively with intention to understand and negatively with intention to reproduce. Combined searching linked with both good and with poor time management.  In task 3 the loads mimic task 1.  It seems Boolean is consistently linked to a reproductive rather than a meaning seeking approach, but also with high levels of interest and fear of failure. Best match associates with the converse of these measures.








Three Target Document Range Metrics for University Web Sites
Mike Thewall and David Wilkinson
Published online 25 February 2003

Thelwall and Wilkinson use crawls of university web sites in the UK, Australia, and New Zealand to generate all links targeted at same country university web sites which they then use to create a graph structure for study. Using Broder's study as a model they identify a strongly connected component, SCC, where one could start anywhere in the set and reach every other page, and an Out component whose pages  can be reached from all strongly connected pages but provide no link back to that set. The other components in the Broder model are not accessible except with access to a major search engine database. In link and out link counts for all three university systems in both the Out and SCC components when graphed logarithmically display the linear nature which would indicate that power laws, and a success breeds success phenomena, are generally in effect. However, automatically generated pages, non-HTML web pages, and large resource-driven sites all were associated with anomalies in this observation.









Searching for Images The Analysis of Users' Queries for Image Retrieval in American History
Youngok Choi and Edie M. Rasmussen
Published online 25 February 2003

Choi and Rasmussen collect queries to the Library of Congress's American Memory photo archive from 48 scholars in American History by way of interviews and pre and post search questionnaires. Their interest is in the types of information need common in the visual domain, and the categories of terms most often used or indicated as appropriate for the description of image contents. Each search resulted in the provision of 20 items for evaluation by the searcher. Terms in queries and acceptable retrievals were categorized by a who, what, when, where faceted classification and queries into four needs categories; specific, general, abstract, and subjective. Two out of three analysts assigned all 38 requests into the same one of the four categories and in 19 cases all three agreed. General/nameable needs accounted for 60.5%, specific needs 26.3%, 7.9% for general/abstract, and 5.3% for subjective needs. The facet analysis indicated most content was of the form person/thing or event/condition limited by geography or time.











Information as Commodity and Economic Sector Its Emergence in the Discourse of Industrial Classification
Cheryl Knott Malone and Fernando Elichirigoity
Published online 25 February 2003

Malone and Elichirigoity review the concept of "information" as it exists in the 1997 implemented North American Industry Classification System (NAICS), the current scheme for the organization of governmental data about the economies of the U.S., Canada, and Mexico. The term represents one of 20 major economic sectors based upon processes of production and upon which data may be reported. It also represents a measurable commodity based upon the concept of copyright. A review of the background studies and reports which document the development of NAICS shows the desire for a single underlying principle, similarity of production processes rather than a marketing approach, and the construction of the information sector within the context of globalization and the internet. The three nations agreed in 1996 that the information sector should consist of industries engaged in the "transformation of information into a commodity that is produced, manipulated and distributed...," or as the NAICS manual states, industries that "primarily create and disseminate a product subject to copyright." However, industries that transfer or transport such products are also included which seems inconsistent with the production principle. In 2002 the category was modified to separate internet publishing and broadcasting from these subcategories and to create an internet services category.









A Method for the Comparative Analysis of Concentration of Author Productivity, Giving Consideration to the Effect of Sample Size Dependency of Statistical Measures
Fuyuki Yoshikane, Kyo Kageura, and Keita Tsuji
Published online 25 February 2003

Studies of the concentration of author productivity based upon counts of papers by individual authors will produce measures that change systematically with sample size.  Yoshikane,  Kageura, and Tsuji seek a statistical framework which will avoid this scale effect problem. Using the number of authors in a field as an absolute concentration measure, and Gini's index as a relative concentration measure, they describe four  literatures form both viewpoints with measures insensitive to one another. Both measures will increase with sample size. They then plot profiles of the two measures on the basis of a Monte-Carlo simulation of 1000 trials for 20 equally spaced intervals and compare the characteristics of the literatures. Using data from conferences hosted by four academic societies between 1992 and 1997, they find a coefficient of loss exceeding 0.15 indicating measures will depend highly on sample size. The simulation shows that a larger sample size leads to lower absolute concentration and higher relative concentration. Comparisons made at the same sample size present quite different results than the original data and allow direct comparison of population characteristics.











Incorporating User Search Behavior into Relevance Feedback
Ian Ruthven, Mounia Lalmas, and Keith van Rijsbergen
Published online 25 February 2003

Ruthvewn,  Mounia, and van Rijsbergen rank and select terms for query expansion using information gathered on searcher evaluation behavior. Using the TREC Financial Times and Los Angeles Times collections and search topics from TREC-6 placed in simulated work situations, six student subjects each preformed three searches on an experimental system and three on a control system with instructions to search by natural language expression in any way they found comfortable. Searching was analyzed for behavior differences between experimental and control situations, and for effectiveness and perceptions. In three experiments paired t-tests were the analysis tool with controls being a no relevance feedback system, a standard ranking for automatic expansion system, and a standard ranking for interactive expansion while the  experimental systems based ranking upon user information on temporal relevance and partial relevance.  Two further experiments compare using user behavior (number assessed relevant and similarity of relevant documents) to choose a query expansion technique against a non-selective technique and finally the effect of providing the user with knowledge of the process. When partial relevance data and time of assessment data are incorporated in term ranking more relevant documents were recovered in fewer iterations, however retrieval effectiveness overall was not improved. The subjects, none-the-less, rated the suggested terms as more useful and used them more heavily. Explanations of what the feedback techniques were doing led to higher use of the techniques.












Requirements for a Cocitation Similarity Measure, with Special Reference to Pearson's Correlation Coefficient
Per Ahlgren, Bo Jarneving, and Ronald Rousseau
Published online 25 February 2003

Ahlgren,  Jarneving, and. Rousseau review accepted procedures for author co-citation analysis first pointing out that since in the raw data matrix the row and column values are identical i,e, the co-citation count of two authors, there is no clear choice for diagonal values. They suggest the number of times an author has been co-cited with himself excluding self citation rather than the common treatment as zeros or as missing values. When the matrix is converted to a similarity matrix the normal procedure is to create a matrix of Pearson's r coefficients between data vectors. Ranking by r and by co-citation frequency and by intuition can easily yield three different orders. It would seem  necessary that the adding of zeros to the matrix will not affect the value or the relative order of similarity measures but it is shown that this is not the case with Pearson's r. Using 913 bibliographic descriptions form the Web of Science of articles form JASIS and Scientometrics, authors names were extracted, edited and 12 information retrieval authors and 12 bibliometric authors each from the top 100 most cited were selected. Co-citation and r value (diagonal elements treated as missing) matrices were constructed, and then reconstructed in expanded form. Adding zeros can both change the r value and the ordering of the authors based upon that value. A chi-squared distance measure would not violate these requirements, nor would the cosine coefficient. It is also argued that co-citation data is ordinal data since there is no assurance of an absolute zero number of co-citations, and thus Pearson is not appropriate. The number of ties in co-citation data make the use of the Spearman rank order coefficient problematic.











Modeling the Information-Seeking Behavior of Social Scientists Ellis's Study Revisited
Lokman I. Meho and Helen R. Tibbo
Published online 25 February 2003

Meho and  Tibbo show that the Ellis model of information seeking applies to a web environment by way of a replication of his study in this case using behavior of social science faculty studying stateless nations, a group diverse in skills, origins, and research specialities. Data were collected by way of e-mail interviews.  Material on stateless nations was limited to papers in English on social science topics published between 1998 and 2000. Of these 251 had 212  unique authors identified as academic scholars and had sufficient information to provide e-mail addresses. Of the 139 whose addresses were located, 9 who were physically close were reserved for face to face interviews, and of the remainder 60 agreed to participate and responded to the 25 open ended question interview. Follow up questions generated a 75% response. Of the possible face to face interviews five agreed to participate and provided 26 thousand words as opposed to 69 thousand by the 45 e-mail participants. The activities of the Ellis model are confirmed but four additional activities are also identified. These are accessing, i.e. finding the material identified in indirect sources of information;  networking, or the maintaining of close contacts with a wide range of colleagues and other human sources;  verifying, i.e. checking the accuracy of new information; and information managing, the filing and organizing of collected information.  All activities are grouped into four stages searching, accessing, processing, and ending.


Electronic Collection Department A Practical Guide, by Stuart D. Lee
Reviewed by Marianne Afifi
Reviewed by Marianne Afifi
Published online 25 February 2003



Beyond Our Control? Confronting the Limits of Our Legal System in the Age of CyberSpace, by Stuart Biegel
Reviewed by Kenneth Einar Himma
Reviewed by Kenneth Einar Himma
Published online 25 February 2003


Economic Growth in the Information Age, by Dale W. Jorgensen
Reviewed by John Cullen
Reviewed by John Cullen
Published online 25 February 2003

