A Comparative Approach to System Evaluation: Delegating Control of Retrieval Tests to an Experimental Online System

Karen M. Drabenstott
Associate Professor
School of Information
University of Michigan
Ann Arbor, MI 48109-1092

Marjorie S. Weller
Programmer Analyst
Medical Center Information Technologies
University of Michigan
Ann Arbor, MI 48109-0726

Abstract

The paper describes the comparative approach to system evaluation used in this research project which delegated the administration of an online retrieval test to an experimental online catalog to produce data for evaluating the effectiveness of a new subject access design. The catalog ushered users through a tightly controlled sequence of events and performed and/or recorded all activity connected with data collection including searcher recruitment, questionnaire administration, and logging of users' searches and relevance assessments. In the absence of human intermediaries to monitor ongoing searches, system users were free to complete as little or as much of the test administration as they wanted as well as enter queries and responses that were unsuitable to the search administration. This paper describes the methods that the researchers enlisted to sort out "problem" test administrations, for example, to identify out-of-scope queries, incomplete system administrations, and suspect post-search questionnaire responses. It also covers how the researchers handled "problem" search administrations and what actions they would use to reduce or eliminate the occurrence of such administrations in future online retrieval tests that delegate control of retrieval tests to online systems.

Introduction

Comparative approaches to system evaluation have used human intermediaries to collect information about searchers and their searching experiences. Such approaches are expensive and time-consuming for several reasons. Human intermediaries are needed to recruit searchers for participation in the system evaluation. Interviewers must be present to monitor end-user searches, ask questions, and record answers. If searchers are not recruited and scheduled ahead of time, interviewers could spend a considerable amount of time recruiting searchers themselves and explaining the study to them. Travel expenses for interviewers can be considerable for extended data-collection periods at off-campus sites. For these reasons, the project director used a comparative approach to system evaluation in this project that did not use interviewers or human intermediaries at all. The experimental online catalog performed and/or recorded all activity connected with data collection - searcher recruitment, questionnaire administration, search logging, and logging of searcher relevance assessments.

The comparative approach used in this study was innovative and experimental. Although online questionnaire administration was used in an early online catalog use study, delegating complete control of retrieval tests to the experimental online catalog had not been done before.

The purpose of this paper is to describe the comparative approach to system evaluation undertaken in a Department of Education-sponsored research project, the administration of an online retrieval test to produce data to evaluate the effectiveness of a new subject access design for online catalog searching, efforts to ensure the reliability of collected data, and strategies that researchers can use to improve on the reliability of data in future online retrieval tests that delegate control of retrieval tests to online systems.

Previous Studies Comparing Systems

The comparative method used in this project is based on the Comparison Search Experiment devised by Siegel et al. (1983) in their evaluation of two prototype online catalogs at the National Library of Medicine (NLM). The goal of Siegel's Comparison Search Experiment was to reproduce realistic search conditions in which library patrons performed a search on a topic of their own choosing on two different systems. The researchers alternated the first system patrons used to minimize any transfer effect, that is, bias caused by patrons transferring their knowledge of the search and of searching in the first system to the second system. The researchers observed patrons and decided when they should switch systems. After completing their search of the second system, the researchers asked patrons questions about their experiences using the two systems, satisfaction with search results, and system preference.

Several researchers have based their evaluation of two or more online catalogs on Siegel's approach. Markey and Demeyer (1986, 109-18) used Siegel's methods to evaluate two experimental online catalogs, one with enhanced subject searching functionality using the Dewey Decimal Classification (DDC) and one without the enhancement. Jones (1988) enlisted the Comparison Search Experiment to compare the performance of the operational LIBERTAS online catalog to the experimental Okapi online catalog. Walker and Jones (1987) enlisted Siegel's approach to evaluate three different Okapi representations that were alike in all ways except for differences in how each representation handled stemming, synonyms and cross referencing, and spelling correction. Researchers evaluated three different Okapi representations that used classification as a recall-improving device and devised a method that "lay somewhere between the Comparison and Sample search experiments used in the NLM experiment" to compare the three systems with regard to effectiveness, efficiency, and user acceptability (Walker and De Vere 1990, 46). Hildreth (1993) employed an approach similar to the 1990 Okapi evaluation to evaluate the usability and retrieval performance of different navigational approaches to subject searching in three different online catalog representations.

Online Questionnaire Administration in Online Catalog Use Studies

The earliest study of online catalog use that enlisted online questionnaire administration was conducted by researchers at the University of California's Division of Library Automation (UC/DLA) who administered online questionnaires to users of the MELVYL online catalog in the Council on Library Resources-sponsored nationwide study of online catalogs (University of California 1982). Despite the success of UC/DLA's online questionnaire administration, this approach has been rarely used in online catalog use studies.

In a Department of Education-sponsored study, a research team at Rutgers University recorded searches and online questionnaire responses administered in online pre- and post-search questionnaires to transaction logs for the purpose of discovering the goals, tasks, and behaviors of library and online catalog users (Belkin et al. 1990).

A research team at City University collected data on online catalog use using methods compatible with those enlisted by the Rutgers team (Hancock-Beaulieu, McKenzie, and Irving 1990). The team developed Olive, a front-end microcomputer program that recorded searches and pre-, in-, and post-search online questionnaire responses; Olive also featured search playback and printing capabilities.

While the earliest comparative studies discussed in the previous section enlisted manual methods to collect data, later studies (Walker and Jones 1987; Walker and De Vere 1990, 46; Hildreth 1993) involved some test administration by the experimental system in the form of logged searches and/or user relevance assessments for retrieved titles. The comparative study described in this paper delegated complete control of test administrations - from the recruitment of end users to logging post-search questionnaire responses - to the experimental system.

Experimental Online Catalog Development

The experimental online catalog named ASTUTE (A Search Tree Underlying the Experiment) was developed by a project team at the University of Michigan, Ann Arbor, to test the new subject-access design. The team programmed ASTUTE on a stand-alone Gateway 2000 486, 33 MHz, IBM-compatible microcomputer, with 8 megabytes of RAM and a VGA color monitor. The operating system was MS-DOS version 5.0. A dot-matrix printer and a mouse were attached to the microcomputer for use by ASTUTE project staff during development work and end users during online retrieval tests.

The two separate databases of the ASTUTE experimental online catalog were created from two data sources: (1) machine-readable cataloging (MARC) records for bibliographic data from the two participating libraries in selected subject areas of the Library of Congress Classification (LCC), and (2) MARC records for subject-authority data from the compact disk-based product CD/MARC Subjects distributed by the Library of Congress. The number and subject areas of MARC bibliographic records were:

  1. Mardigian Library of the University of Michigan-Dearborn: 14,686 bibliographic records in Computer Science (QA76's) and Technology (T-TX).
  2. Lilly Library of Earlham College: 11,976 bibliographic records in American History (E1-F1199).

Comparison Search Experiment

Recruiting searchers.

The researchers installed the Gateway microcomputer bearing ASTUTE at the two data collection sites - Mardigian Library at the University of Michigan-Dearborn and Lilly Library at Earlham College. The microcomputer was dedicated to use of the ASTUTE experimental online catalog. The ASTUTE experimental online catalog performed recruiting functions on its own. The presence of the microcomputer equipment in a public place in the library, posted written signs, and two alternating introductory screen savers attracted potential respondents. ASTUTE was installed in Mardigian and Lilly Libraries for several weeks at a time. If interested or curious library users did not have time to use the system the first time they saw it, they might have used it on subsequent library visits when they had more time.

ASTUTE featured five introductory screens. The first two introductory screens invited library patrons to test the system's ability to find titles on their topics of interest. They also functioned as screen savers to avoid the image of one of the screens burning into the monitor. When users clicked on the <Continue> button in the first two introductory screens, ASTUTE responded with a series of three additional introductory screens that explained their participation in the test of the system. Library users were entirely on their own to read screens, conduct searches, and answer questions.

When users clicked on the <Continue> button in the fifth introductory screen, ASTUTE responded with a pop-up window asking users whether they wanted to search the system for engineering or computer science topics (at UM-Dearborn) or American history topics (at Earlham). A negative answer to this question gave users who did not want to search ASTUTE a graceful way of exiting the system. In response to a user's positive answer to this question, ASTUTE began displaying the pre-search questionnaire and recording all subsequent user actions and system responses to a transaction log

Administering the pre-search questionnaire.

The pre-search questionnaire contained three questions. It asked users how many times they had used ASTUTE, how often they used computer systems besides ASTUTE, and asked users to identify their major field of study. Pre-search and all but one post-search questions were closed-ended questions. ASTUTE listed response categories following each question and users could use the mouse or arrow keys to highlight and select a category.

When ASTUTE began writing user actions and system responses to a transaction log, it activated a timeout function to determine whether users had terminated their search by walking away from the system. If ASTUTE recorded no user activity for four minutes, the system prompted the user to respond. If it detected no response, the system closed the transaction log for the particular search administration, prepared to open the log for the next search administration, and displayed the first introductory screen.

Searching the experimental online catalog.

The Comparison Search Experiment was designed to be as realistic as possible in that library patrons were searching the experimental online catalog with their own search topics and assessing the usefulness of retrieved items. Furthermore, transaction logging and online administration of pre- and post-search questionnaires made the data collection process as unobtrusive as possible.

The objective of the research project was to test a new subject access design to online catalogs in which search trees controlled the system's selection of subject searching approaches in response to user queries (Drabenstott and Weller 1994). To test this new design, ASTUTE split into two experimental online catalogs following user responses to the third pre-search question: (1) the Blue System in which search trees controlled the system's selection of a subject searching approaches in response to user queries, and (2) the Pinstripe System which chose subject searching approaches randomly in response to user queries.

Following the third pre-search question, ASTUTE asked users whether their query involved the name of a person. The Blue System only used answers to this question to select between search trees for subjects generally and search trees for personal names as subjects. The Blue or Pinstripe System then asked users to enter their subject queries. As long as participants did not make a move to start a new search or enter a new query, they could search the Blue or Pinstripe System for as long as they wanted. When they made such a move, the Blue System (for odd-numbered participants) switched and automatically conducted a search for the original user-entered query in the Pinstripe System; the Pinstripe System (for even-numbered participants) switched and automatically conducted a search for the original user-entered query in the Blue System. As long as participants did not make a move to start a new search or enter a new query, they could search the second system for as long as they wanted. When they made such a move, ASTUTE initiated the post-search questionnaire. Both Blue and Pinstripe Systems recorded all user activity and system responses to transaction logs.

When users displayed bibliographic records in subject searches, both Blue and Pinstripe Systems asked users to assess their usefulness and enumerated three categories for rating usefulness: (1) useful, (2) possibly useful, and (3) not useful. Both Blue and Pinstripe Test Systems recorded user assessments of retrieved records to transaction logs.

Administering the post-search questionnaire.

The post-search questionnaire contained eleven questions that asked users to compare and assess the performance of the two systems in view of the useful titles they retrieved. All questions were closed-ended except for question 14. If users responded "very interested" to a question about their interest in seeing ASTUTE's capabilities extended to topics besides the ones in ASTUTE, the system asked users to type in the subjects that interested them. ASTUTE logged user responses to post-search questions to a transaction log.

Involving library staff in the comparison search experiment.

For all but the last two days of five-week and twelve-week test administration periods, ASTUTE was dedicated to patron use in participating libraries. On the last two days, project team members conducted the Comparison Search Experiment with library staff volunteers. Data collection procedures were almost the same as such procedures with end users. Differences were: (1) an interviewer monitored staff searchers, (2) staff conducted a total of six searches for three different topics in both Blue and Pinstripe Systems, and (3) the interviewer supplemented pre- and post-search questions with open-ended questions regarding their searching experiences and their system preferences.

The failure analysis of ASTUTE's subject searching capabilities benefited from the few post-search interviews we conducted with library staff. Although end users gave answers to post-search questions about the system's effectiveness in retrieving useful information, these questions were closed ended, and, thus, could not probe users for details about their difficulties using particular search functions or their level of understanding with respect to system messages, prompts, or instructions. If we could redesign this study, we would supplement end-user searches, questionnaire responses, and relevance assessments with open-ended interview data. We would conduct interviews with some system users. We would also ask staff at participating libraries to volunteer to conduct these interviews.

Installation Period Problems

Installation problems at participating libraries.

ASTUTE project staff first made ASTUTE available to UM-Dearborn library patrons for a five-week period beginning on November 11, 1993. Mardigian Library staff performed daily maintenance duties, e.g., system start up, shut down, and backup, and ASTUTE project team members made occasional visits to Mardigian Library to make sure that the system was functioning properly and to make backups of transaction log files. Investigating frequent system crashes on a visit toward the end of the five-week installation period, the systems programmer concluded that system and transaction log files had been corrupted when users unexpectedly turned system equipment on and off.

We studied logged searches to determine the extent of the damage and decided that the number of corrupted searches was so large that we doubted the reliability of collected data. We informed Mardigian Library staff liaison of the problem who invited the project team back to the library for a second installation and data collection period.

Investigating the problem of corrupted system and log files, the ASTUTE project team introduced several changes to ASTUTE to ensure that files would not be corrupted in future system installations at participating libraries. Examples of changes connected with turning the machine on and off were:

We did not experience the same problem at subsequent installation periods at UM-Dearborn and Earlham College. We also instituted a new maintenance policy. In lieu of making periodic visits to the sites to make sure that the system was functioning properly and make backups of transaction log record files, UM-D and Earlham Library staff sent transaction log records to the ASTUTE project team via file transfer protocol (ftp) or campus mail on a daily basis. The ASTUTE project team reviewed the logs as soon as possible to make sure everything was functioning properly.

Identifying usable search administrations.

The second data collection period at the University of Michigan-Dearborn lasted five weeks from March 12 to April 19, 1993. ASTUTE administered a total of 826 Comparison Search Experiments. At Earlham College, data collection lasted thirteen weeks from February 23 to May 28, 1993. ASTUTE administered a total of 238 administrations of the Comparison Search Experiment. Thirty-three search administrations involved library staff at the two participating libraries.

The project team expected users to enter searches for topics that were not represented in ASTUTE's databases and to leave the experiment without completing the full search administration because interviewers were not present to monitor system use. To determine usable search administrations for submission to data analyses, we had to manually review searches and queries. This procedure was time-consuming and task-intensive. About the only details that could be computerized were the identification of search administrations that users terminated prior to entering their queries and suspect post-search questionnaires which users completed by repeatedly selecting the first response category.

Figure 1 shows the percentages of usable and unusable queries in UM-D and Earlham search administrations.

Figure 1. Usable and unusable administrations

Of the 1,064 search administrations, about half (528 of 1,064 administrations) were usable. In about a third (34%) of unusable administrations, users entered queries into the experimental online catalog's subject-searching capabilities for subjects generally. About three-quarters of these subject queries were out-of-scope, that is, ASTUTE's bibliographic-record databases did not contain titles for the requested topics and the remaining quarter were unusable for several other reasons. In about one-eighth (12%) of unusable administrations, users entered queries into the experimental online catalog's subject-searching capabilities for personal names. Some of these personal-name queries were out-of-scope, others were elements of known-item searches, and still others were playing or meaningless input. Less than 5% of unusable searches were search administrations in which users completed one or more pre-search questions but they did not enter queries. These users probably walked away from the system and it eventually reset itself to the introductory screen savers. Percentages in the four basic categories shown in figure 1 were almost the same at UM-D and Earlham.

Subject and personal-name queries in unusable search administrations.

Table 1 summarizes the characteristics of subject and personal-name queries in unusable search administrations. The queries in these unusable administrations were entered through subject searching capabilities or through personal-name subject searching capabilities. The system's choice between the two types of capabilities was based on user responses to a question following the pre-search questionnaire that asked users whether their query involved the name of a person.

Table 1. Subject and Personal-name Queries in Unusable Search Administrations

Query categoriesCapabilities for SubjectsCapabilities for Personal Names
No.%No.%
Out of scope
Subjects 185 50.7 3 2.3
Personal names 15 4.1 27 21.3
Names and subjects 0 0.0 38 29.9
Names, subjects, and call number 0 0.0 1 0.8
Playing
Subjects 19 5.2 0 0.0
Personal names 3 0.8 2 1.6
Names and subjects 2 0.5 10 7.9
Title 2 0.5 0 0.0
Meaningless input:
Blank 76 20.8 1 0.8
Character 1 0.3 0 0.0
Gibberish 19 5.2 9 7.1
Letter 9 2.5 0 0.0
Letters 6 1.7 0 0.0
Known-item search
Title search 4 1.1 0 0.0
Author-title 0 0.0 28 22.1
Author 1 0.3 2 1.6
Other unusable searches
Sex term 12 3.3 3 2.3
Expletive 2 0.5 0 0.0
Command 1 0.3 0 0.0
Subject entered as a name n/a n/a 3 2.3
System error 8 2.2 0 0.0
Total 365 100.0 127 100.0

Of the 365 queries entered through subject searching capabilities, the largest percentage (50.7%) were for out of scope subjects. Examples of such queries entered by UM-D users were "astrophysics," "croatia," "euthanasia," and "activity based costing," and by Earlham users were "illegal abortion," "health conditions south," "koreans in japan," and "manhattan project." There were only three out-of-scope subject queries (0.8%) entered through personal-name subject searching capabilities. The queries were "woodstock" and "wood stock" (at Earlham) and "medical statistics" (at UM-D).

Users entered out-of-scope personal-name queries through both general subject searching and personal-name subject searching capabilities. Examples of out-of-scope queries for personal names that UM-D users entered using the system's general subject searching capabilities were "assad," "bush, g," "plotkin," and "sinclair upton." Over three-quarters of out-of-scope personal-name queries bearing both name and subject elements came from the test administration at Earlham. Examples of these queries are given below in terms of the elements users entered:

Last name First name Topic
tolstoy peter history
kennedy robert assassination
charlemagne charlemagne education
shakespeare william actors in shakespeares time
piaget jean psychology
Users entered queries in Comparison Search Experiments that were probably not serious queries. We felt users were "playing," and, thus, categorized such queries into "playing" categories. Examples are:

Several types of queries constituted meaningless input: blank lines, character(s), gibberish, and letter(s). UM-D users entered most of the queries categorized here.

Users entered queries through general subject searching capabilities that were probably known-item searches for titles. Three queries entered into ASTUTE at UM-D were similar: "schaum's" (twice) and "schaum's outline series dynamics." At Earlham, a single user entered a query for a known item using general subject searching capabilities - "the age of jackson." Since title words were included in searchable title-keyword and keyword-in-record databases and these queries were relevant to engineering and American history, searches in both Blue and Pinstripe Systems yielded relevant retrievals. However, search administrations bearing these four queries were discarded from subsequent analyses because they were deemed known-item searches.

Three queries were computer science or engineering topics that UM-D users entered into the experimental online catalog through personal-name subject searching capabilities. The queries were "joint application development," "lagrange," and "dell computers." The user entering the query "lagrange" could have been looking for biographical information on the French mathematician Joseph-Louis Lagrange. However, a user entered the query "lagrangian dynamics" five minutes before the "lagrange" query. Thus, the researchers felt that the two users were the same person who wanted information on a topical subject - not on a personal name. The user entering the query with the last and first name elements "dell computers," respectively, could have responded positively to the question on persons that guided the system in its selection of personal name subject searching capabilities because company named Dell Computers sounded like it was named after a person with the last name of "Dell." With regard to these queries, it was understandable why users responded positively to the question on persons that guided the system in its selection of personal-name subject searching capabilities. Some fine tuning to the wording of this question may be necessary to minimize the likelihood of users taking actions that invoke subject searching approaches for personal names when their queries contain proper adjectives or nouns.

Reliability of relevance assessments.

This was not the first study of user queries to discover that users entered terms that were not serious subject queries. Several studies (Drabenstott and Vizine-Goetz, 1994; Peters 1989; Henty 1986; Walter 1987; Hunter 1991; Dickson 1984; Lester 1991) characterized such queries as "sex terms," "malicious entries," "gibberish," "garbage," "wrong file," etc. Although it seemed unlikely that such queries would produce retrievals, there were several situations in which the Blue and Pinstripe Systems gave users opportunities to display subject headings and bibliographic records. For example, the Pinstripe System gave users the opportunity to select subject headings and display bibliographic records connected to them because it randomly selected the alphabetical approach in response to user queries regardless of the extent to which they matched listed headings. In personal-name searches, the Blue System responded with the alphabetical approach to name-query elements as a last resort. Keyword searches in both systems were sometimes successful in retrieving records for single letters or sex terms.

The discouraging factor with these retrievals was the preponderance of "very useful" ratings users gave to displayed bibliographic records. Let us illustrate this phenomenon in an example. A user at UM-Dearborn entered the single letter query "j" into the Pinstripe System. The system displayed two titles retrieved in a keyword-in-record search. The user gave one title a "very useful" rating and the other title a "possibly useful" rating even though the two records were on totally different subjects, viz. CAD and jet propulsion systems.

This example and the relevance assessments made by many other users in unusable search administrations made us skeptical about the reliability of relevance assessments and post-search questionnaire responses. We re-examined the method we used to gather such data to determine how it could be changed in future administrations of online questionnaires.

Figure 2 shows the Blue System's display of a title retrieved in a search for "civil rights" bearing a pop-up window that gives users three categories for rating usefulness: (1) "very useful," (2) "possibly useful," and (3) "not useful." The user could click on one of these three categories using the mouse or use a combination of arrow keys and <Enter> key to select a category. The first-listed category (i.e., "very useful") was given in white type and set in a medium-blue color against a light-blue background. The other categories were not highlighted but they were given in light yellow type and set against a medium-blue background. The simplest action for the user to take was to hit the <Enter> key. This made the system choose the first-listed category, i.e., "very useful" and record it to transaction log records. Reviewing relevance assessments for unusable queries, the researchers felt that many users performed this simple action in response to relevance assessment pop-up boxes. This action could have been more prevalent in searches for unusable queries that were meaningless, were for play, sex terms, or expletives, because users were not serious about their searches.

Figure 2. Bibliographic record on "Civil rights"

If the ASTUTE project team could redesign the relevance assessment pop-up window, they would make the selection process a much more deliberate step on the part of the user than hitting the <Enter> key. There are several possibilities and combinations of possibilities for redesigning this window. If the first-listed category were to remain highlighted, it should be changed to a category such as "not useful" or "don't know." Thus, a user's selection of the first-listed category by the simple action of hitting the <Enter> key would not "stack the deck" in favor of "useful" relevance assessments but it would allow for a conservative estimate of useful titles. Another approach would involve requiring users to type in a number or letter corresponding to the category of their choice instead of using the mouse or arrow keys to select listed categories. This would make their selection more deliberate than repeatedly hitting the <Enter> key to select the first-listed response category. The researchers also felt that users would be more conscientious about selecting categories that reflected the relevance of displayed items if they knew that the system would use such assessments to find additional titles. Thus, online catalogs that feature relevance assessment capabilities should inform users that their assessments will be used for finding additional titles to ensure that users give serious consideration to their relevance assessments.

Usable Search Administrations

When unusable administrations connected with walk aways, the entry of queries through personal-name or subject searching capabilities were discarded, usable search administrations remained. Table 2 summarizes the characteristics of these administrations.

Table 2. Usable Administrations

Events Completed events
Pre-search questionnaire responses x x x x x
Pinstripe System search x x x x
Blue System search x x x x
Post-search questionnaire responses x
Suspect post-search questionnaire responses x
Number of usable search administrations
(N=528)
71 (13%) 74 (14%) 75 (14%) 80 (15%) 228 (43%)

A large percentage (43%) of usable administrations of the Comparison Search Experiment were full administrations, that is, users completed all four experimental events. Of the four partial-administration categories, the largest percentage (15%) contained the four complete events; however, responses to the post-search questionnaire were suspect. That is, post-search questionnaire responses were default responses that users could have selected by repeatedly pressing the <Enter> key or by repeatedly entering the same number. In the data analysis, the researchers considered these responses suspect, discarded them, and combined the remaining events in these administrations with search administrations in which pre-search questionnaires and searches in Blue and Pinstripe Systems had been completed. The latter administrations accounted for 14% of usable search administrations. When they were combined with the former, the two categories of usable search administrations amounted to almost a third (29%) of usable search administrations.

Reliability of Post-search Questionnaire Responses

The experimental online catalog always concluded search administrations with post-search questionnaires. Although many users whose search administrations were deemed unusable failed to complete post-search questionnaires, about the same number completed them or most of them. Many users who entered out-of-scope queries were serious about their questionnaire responses in that they chose categories that documented the system's inability to produce retrievals for their queries. Other users such as those who were not serious about their queries (e.g., queries in "letter," "sex term," "gibberish" categories) were not serious about their responses to the questionnaire and chose responses that did not make sense in view of their search results. For example, there were searches in which neither system retrieved titles but users chose answers to post-search questions that implied that one or both systems retrieved titles.

Another problem with post-search questionnaire responses was similar to the problem with relevance assessments. In unusable and usable search administrations, users repeatedly selected the first-listed response category in post-search questionnaires. A manual review of searches and relevance assessments made the researchers skeptical about the reliability of these post-search questionnaire responses. We examined the method used to gather such data to determine how it could be changed in future administrations of online questionnaires.

The experimental online catalog displayed each question and response categories separate from other questions and response categories. The first-listed category was given in a bright yellow type and set in a medium-orange color against a dark blue background. The other categories were given in a light yellow type against a dark blue background. The simplest action for the user to take was to hit the <Enter> key. This would make the system choose the first-listed category and save it to transaction log records.

In the analysis of usable search administrations, the researchers disregarded post-search questionnaires in which users selected all first-listed response categories.

If the ASTUTE project team could redesign the post-search questionnaire, they would make the selection process a much more deliberate step on the part of the user than just hitting the <Enter> key. There are several possibilities and combinations of possibilities for redesigning the selection process. If the first-listed category were to remain highlighted, it could be changed to a category such as "don't know" or "don't care." First-listed categories could also be different from question to question. When users selected one or more such categories, the truthfulness of their responses to other post-search questions could be in doubt. Another approach would involve requiring users to type in a number or letter corresponding to the category of their choice. This would make their selection more deliberate than repeatedly hitting the <Enter> key.

Conclusions

The purpose of this paper was to describe the comparative approach to system evaluation undertaken in a Department of Education-sponsored research project, the administration of an online retrieval test to produce data to compare the effectiveness of a new subject access design for online catalog searching, efforts to ensure the reliability of collected data, and strategies that researchers can use to improve on the reliability of data in future online retrieval tests that delegate control of retrieval tests to online systems.

The catalog ushered users through a tightly controlled sequence of events and performed and/or recorded all activity connected with data collection including searcher recruitment, questionnaire administration, and logging of users' searches and relevance assessments. In the absence of human intermediaries to monitor ongoing searches, system users were free to complete as little or as much of the test administration as they wanted and enter queries and responses that were unsuitable to the search administration.

In view of our experience delegating data-collection procedures to an experimental online system and the analysis of the data that this system collected, we would introduce the following changes and enhancements to future experiments that enlist systems to perform data-collecting procedures:

Although online questionnaire administration was used in an early online catalog use studies, delegating complete control of retrieval tests in a comparative evaluation of online catalogs has not been done before. It is hoped that the analyses and experiences described in this paper help future researchers plan and execute studies that feature system-controlled administrations of online retrieval tests.

Acknowledgments

The Department of Education supported this research effort (grant number R197D10025) through its College Library Technology and Cooperation Grants. We are grateful to the contributions of many people: Timothy Richards, Robert Kelly, and Barbara Kriigel at the University of Michigan-Dearborn, Evan Farber and Michael Bowden at Earlham College, who helped us build ASTUTE's databases, monitored the system during lengthy installation periods, and recruited library staff for retrieval tests.

References

Belkin, Nicholas J. et al. (1990). Taking account of users tasks, goals and behavior for the design of online public access catalogs. In ASIS '90; proceedings of the 53rd ASIS annual meeting; 4-8 November, 1990, Toronto, edited by Diane Henderson, 69-79. Medford, N.J.: Learned Information.

Dickson, Jean. (1984). An analysis of user errors in searching an online catalog. Cataloging & Classification Quarterly 4, 3 (Spring): 19-38.

Drabenstott, Karen M., Vizine-Goetz, Diane. (1994). Using subject headings for online retrieval: Theory, practice, and potential. San Diego: Academic Press.

Drabenstott, Karen M., and Weller, Marjorie S. (1995). Testing a new design for subject access to online catalogs. Ann Arbor, Mich.: School of Information and Library Studies.

Drabenstott, Karen M., and Marjorie Weller. Testing a new design for subject searching in online catalogs. Library Hi Tech 12, 1: 67-76.

Hancock-Beaulieu, Micheline, Lorna McKenzie, and Avril Irving. (1991). Evaluative protocols for searching behaviour in online library catalogues. London: British Library. British Library R&D Report 6031.

Henty, Margaret. (1986). The users at the online catalogue: a record of unsuccessful keyword searches. LASIE 17, 2 (September/October): 4-52.

Hildreth, Charles R. (1993). An evaluation of structured navigation for subject searching in online catalogues. Ph.D. dissertation. Department of Information Science, The City University, London.

Hunter, Rhonda N. (1991). Successes and failures of patrons searching the online catalog at a large academic library: a transaction log analysis. RQ 30, 3 (Spring): 395-402.

Jones, Richard M. (1988). A comparative evaluation of two online public access catalogues. London: British Library. British Library Research Paper 39.

Lester, Marilyn Ann. (1989). Coincidence of user vocabulary and Library of Congress Subject Headings: experiments to improve subject access in academic library online catalogs. Ph.D. dissertation, University of Illinois at Urbana-Champaign.

Markey, Karen, and Anh N. Demeyer. (1986). Dewey Decimal Classification online project: Evaluation of a library schedule and index integrated into the subject searching capabilities of an online catalog. Dublin, Ohio: OCLC. OCLC Research Report OCLC/OPR/RR-86/1.

Peters, Thomas A. (1989). When smart people fail: An analysis of the transaction log of an online public access catalog. Journal of Academic Librarianship 15, 5 (November): 267-73.

Siegel, Elliott R. et al. (1983). Research strategy and methods used to conduct a comparative evaluation of two prototype online catalog systems. In National Online Meeting proceedings -1983, 1983 April 12-14, New York, compiled by Martha E. Williams and Thomas H. Hogan, 503-11. Medford, N.J.: Learned Information.

University of California, Division of Library Automation and Library Research and Analysis Group. (1982). Users look at online catalogs: results of a national survey of users and non-users of online public access catalogs: final report to the Council on Library Resources. Berkeley, Calif.: University of California, November 16.

Walker, Stephen, and Richard M. Jones. (1987). Improving subject retrieval in online catalogues; 1. Stemming, automatic spelling correction and cross-reference tables. London: British Library. British Library Research Paper 24.

Walker, Stephen, and Rachel De Vere. (1990). Improving subject retrieval in online catalogues; 2. Relevance feedback and query expansion. London: British Library. British Library Research Paper 72.

Walter, Dennis R. (1987). The user at the online catalogue: A record of unsuccessful keyword searches - another case study. LASIE 18, 3 (November/December): 74-81.


© 1996, American Society for Information Science. Permission to copy and distribute this document is hereby granted provided that this copyright notice is retained on all copies and that copies are not altered.