Marjorie S. Weller
Programmer Analyst
Medical Center Information Technologies
University of Michigan
Ann Arbor, MI 48109-0726
The comparative approach used in this study was innovative and experimental. Although online questionnaire administration was used in an early online catalog use study, delegating complete control of retrieval tests to the experimental online catalog had not been done before.
The purpose of this paper is to describe the comparative approach to system evaluation undertaken in a Department of Education-sponsored research project, the administration of an online retrieval test to produce data to evaluate the effectiveness of a new subject access design for online catalog searching, efforts to ensure the reliability of collected data, and strategies that researchers can use to improve on the reliability of data in future online retrieval tests that delegate control of retrieval tests to online systems.
Several researchers have based their evaluation of two or more online catalogs on Siegel's approach. Markey and Demeyer (1986, 109-18) used Siegel's methods to evaluate two experimental online catalogs, one with enhanced subject searching functionality using the Dewey Decimal Classification (DDC) and one without the enhancement. Jones (1988) enlisted the Comparison Search Experiment to compare the performance of the operational LIBERTAS online catalog to the experimental Okapi online catalog. Walker and Jones (1987) enlisted Siegel's approach to evaluate three different Okapi representations that were alike in all ways except for differences in how each representation handled stemming, synonyms and cross referencing, and spelling correction. Researchers evaluated three different Okapi representations that used classification as a recall-improving device and devised a method that "lay somewhere between the Comparison and Sample search experiments used in the NLM experiment" to compare the three systems with regard to effectiveness, efficiency, and user acceptability (Walker and De Vere 1990, 46). Hildreth (1993) employed an approach similar to the 1990 Okapi evaluation to evaluate the usability and retrieval performance of different navigational approaches to subject searching in three different online catalog representations.
In a Department of Education-sponsored study, a research team at Rutgers University recorded searches and online questionnaire responses administered in online pre- and post-search questionnaires to transaction logs for the purpose of discovering the goals, tasks, and behaviors of library and online catalog users (Belkin et al. 1990).
A research team at City University collected data on online catalog use using methods compatible with those enlisted by the Rutgers team (Hancock-Beaulieu, McKenzie, and Irving 1990). The team developed Olive, a front-end microcomputer program that recorded searches and pre-, in-, and post-search online questionnaire responses; Olive also featured search playback and printing capabilities.
While the earliest comparative studies discussed in the previous section enlisted manual methods to collect data, later studies (Walker and Jones 1987; Walker and De Vere 1990, 46; Hildreth 1993) involved some test administration by the experimental system in the form of logged searches and/or user relevance assessments for retrieved titles. The comparative study described in this paper delegated complete control of test administrations - from the recruitment of end users to logging post-search questionnaire responses - to the experimental system.
The two separate databases of the ASTUTE experimental online catalog were created from two data sources: (1) machine-readable cataloging (MARC) records for bibliographic data from the two participating libraries in selected subject areas of the Library of Congress Classification (LCC), and (2) MARC records for subject-authority data from the compact disk-based product CD/MARC Subjects distributed by the Library of Congress. The number and subject areas of MARC bibliographic records were:
ASTUTE featured five introductory screens. The first two introductory screens invited library patrons to test the system's ability to find titles on their topics of interest. They also functioned as screen savers to avoid the image of one of the screens burning into the monitor. When users clicked on the <Continue> button in the first two introductory screens, ASTUTE responded with a series of three additional introductory screens that explained their participation in the test of the system. Library users were entirely on their own to read screens, conduct searches, and answer questions.
When users clicked on the <Continue> button in the fifth introductory screen, ASTUTE responded with a pop-up window asking users whether they wanted to search the system for engineering or computer science topics (at UM-Dearborn) or American history topics (at Earlham). A negative answer to this question gave users who did not want to search ASTUTE a graceful way of exiting the system. In response to a user's positive answer to this question, ASTUTE began displaying the pre-search questionnaire and recording all subsequent user actions and system responses to a transaction log
When ASTUTE began writing user actions and system responses to a transaction log, it activated a timeout function to determine whether users had terminated their search by walking away from the system. If ASTUTE recorded no user activity for four minutes, the system prompted the user to respond. If it detected no response, the system closed the transaction log for the particular search administration, prepared to open the log for the next search administration, and displayed the first introductory screen.
The objective of the research project was to test a new subject access design to online catalogs in which search trees controlled the system's selection of subject searching approaches in response to user queries (Drabenstott and Weller 1994). To test this new design, ASTUTE split into two experimental online catalogs following user responses to the third pre-search question: (1) the Blue System in which search trees controlled the system's selection of a subject searching approaches in response to user queries, and (2) the Pinstripe System which chose subject searching approaches randomly in response to user queries.
Following the third pre-search question, ASTUTE asked users whether their query involved the name of a person. The Blue System only used answers to this question to select between search trees for subjects generally and search trees for personal names as subjects. The Blue or Pinstripe System then asked users to enter their subject queries. As long as participants did not make a move to start a new search or enter a new query, they could search the Blue or Pinstripe System for as long as they wanted. When they made such a move, the Blue System (for odd-numbered participants) switched and automatically conducted a search for the original user-entered query in the Pinstripe System; the Pinstripe System (for even-numbered participants) switched and automatically conducted a search for the original user-entered query in the Blue System. As long as participants did not make a move to start a new search or enter a new query, they could search the second system for as long as they wanted. When they made such a move, ASTUTE initiated the post-search questionnaire. Both Blue and Pinstripe Systems recorded all user activity and system responses to transaction logs.
When users displayed bibliographic records in subject searches, both Blue and Pinstripe Systems asked users to assess their usefulness and enumerated three categories for rating usefulness: (1) useful, (2) possibly useful, and (3) not useful. Both Blue and Pinstripe Test Systems recorded user assessments of retrieved records to transaction logs.
For all but the last two days of five-week and twelve-week test administration periods, ASTUTE was dedicated to patron use in participating libraries. On the last two days, project team members conducted the Comparison Search Experiment with library staff volunteers. Data collection procedures were almost the same as such procedures with end users. Differences were: (1) an interviewer monitored staff searchers, (2) staff conducted a total of six searches for three different topics in both Blue and Pinstripe Systems, and (3) the interviewer supplemented pre- and post-search questions with open-ended questions regarding their searching experiences and their system preferences.
The failure analysis of ASTUTE's subject searching capabilities benefited from the few post-search interviews we conducted with library staff. Although end users gave answers to post-search questions about the system's effectiveness in retrieving useful information, these questions were closed ended, and, thus, could not probe users for details about their difficulties using particular search functions or their level of understanding with respect to system messages, prompts, or instructions. If we could redesign this study, we would supplement end-user searches, questionnaire responses, and relevance assessments with open-ended interview data. We would conduct interviews with some system users. We would also ask staff at participating libraries to volunteer to conduct these interviews.
ASTUTE project staff first made ASTUTE available to UM-Dearborn library patrons for a five-week period beginning on November 11, 1993. Mardigian Library staff performed daily maintenance duties, e.g., system start up, shut down, and backup, and ASTUTE project team members made occasional visits to Mardigian Library to make sure that the system was functioning properly and to make backups of transaction log files. Investigating frequent system crashes on a visit toward the end of the five-week installation period, the systems programmer concluded that system and transaction log files had been corrupted when users unexpectedly turned system equipment on and off.
We studied logged searches to determine the extent of the damage and decided that the number of corrupted searches was so large that we doubted the reliability of collected data. We informed Mardigian Library staff liaison of the problem who invited the project team back to the library for a second installation and data collection period.
Investigating the problem of corrupted system and log files, the ASTUTE project team introduced several changes to ASTUTE to ensure that files would not be corrupted in future system installations at participating libraries. Examples of changes connected with turning the machine on and off were:
The project team expected users to enter searches for topics that were not represented in ASTUTE's databases and to leave the experiment without completing the full search administration because interviewers were not present to monitor system use. To determine usable search administrations for submission to data analyses, we had to manually review searches and queries. This procedure was time-consuming and task-intensive. About the only details that could be computerized were the identification of search administrations that users terminated prior to entering their queries and suspect post-search questionnaires which users completed by repeatedly selecting the first response category.
Figure 1 shows the percentages of usable and unusable queries in UM-D and Earlham search administrations.
| Query categories | Capabilities for Subjects | Capabilities for Personal Names | |||
|---|---|---|---|---|---|
| No. | % | No. | % | ||
| Out of scope | |||||
| Subjects | 185 | 50.7 | 3 | 2.3 | |
| Personal names | 15 | 4.1 | 27 | 21.3 | |
| Names and subjects | 0 | 0.0 | 38 | 29.9 | |
| Names, subjects, and call number | 0 | 0.0 | 1 | 0.8 | |
| Playing | |||||
| Subjects | 19 | 5.2 | 0 | 0.0 | |
| Personal names | 3 | 0.8 | 2 | 1.6 | |
| Names and subjects | 2 | 0.5 | 10 | 7.9 | |
| Title | 2 | 0.5 | 0 | 0.0 | |
| Meaningless input: | |||||
| Blank | 76 | 20.8 | 1 | 0.8 | |
| Character | 1 | 0.3 | 0 | 0.0 | |
| Gibberish | 19 | 5.2 | 9 | 7.1 | |
| Letter | 9 | 2.5 | 0 | 0.0 | |
| Letters | 6 | 1.7 | 0 | 0.0 | |
| Known-item search | |||||
| Title search | 4 | 1.1 | 0 | 0.0 | |
| Author-title | 0 | 0.0 | 28 | 22.1 | |
| Author | 1 | 0.3 | 2 | 1.6 | |
| Other unusable searches | |||||
| Sex term | 12 | 3.3 | 3 | 2.3 | |
| Expletive | 2 | 0.5 | 0 | 0.0 | |
| Command | 1 | 0.3 | 0 | 0.0 | |
| Subject entered as a name | n/a | n/a | 3 | 2.3 | |
| System error | 8 | 2.2 | 0 | 0.0 | |
| Total | 365 | 100.0 | 127 | 100.0 | |
Of the 365 queries entered through subject searching capabilities, the largest percentage (50.7%) were for out of scope subjects. Examples of such queries entered by UM-D users were "astrophysics," "croatia," "euthanasia," and "activity based costing," and by Earlham users were "illegal abortion," "health conditions south," "koreans in japan," and "manhattan project." There were only three out-of-scope subject queries (0.8%) entered through personal-name subject searching capabilities. The queries were "woodstock" and "wood stock" (at Earlham) and "medical statistics" (at UM-D).
Users entered out-of-scope personal-name queries through both general subject searching and personal-name subject searching capabilities. Examples of out-of-scope queries for personal names that UM-D users entered using the system's general subject searching capabilities were "assad," "bush, g," "plotkin," and "sinclair upton." Over three-quarters of out-of-scope personal-name queries bearing both name and subject elements came from the test administration at Earlham. Examples of these queries are given below in terms of the elements users entered:
| Last name | First name | Topic |
| tolstoy | peter | history |
| kennedy | robert | assassination |
| charlemagne | charlemagne | education |
| shakespeare | william | actors in shakespeares time |
| piaget | jean | psychology |
Users entered queries through general subject searching capabilities that were probably known-item searches for titles. Three queries entered into ASTUTE at UM-D were similar: "schaum's" (twice) and "schaum's outline series dynamics." At Earlham, a single user entered a query for a known item using general subject searching capabilities - "the age of jackson." Since title words were included in searchable title-keyword and keyword-in-record databases and these queries were relevant to engineering and American history, searches in both Blue and Pinstripe Systems yielded relevant retrievals. However, search administrations bearing these four queries were discarded from subsequent analyses because they were deemed known-item searches.
Three queries were computer science or engineering topics that UM-D users entered into the experimental online catalog through personal-name subject searching capabilities. The queries were "joint application development," "lagrange," and "dell computers." The user entering the query "lagrange" could have been looking for biographical information on the French mathematician Joseph-Louis Lagrange. However, a user entered the query "lagrangian dynamics" five minutes before the "lagrange" query. Thus, the researchers felt that the two users were the same person who wanted information on a topical subject - not on a personal name. The user entering the query with the last and first name elements "dell computers," respectively, could have responded positively to the question on persons that guided the system in its selection of personal name subject searching capabilities because company named Dell Computers sounded like it was named after a person with the last name of "Dell." With regard to these queries, it was understandable why users responded positively to the question on persons that guided the system in its selection of personal-name subject searching capabilities. Some fine tuning to the wording of this question may be necessary to minimize the likelihood of users taking actions that invoke subject searching approaches for personal names when their queries contain proper adjectives or nouns.
The discouraging factor with these retrievals was the preponderance of "very useful" ratings users gave to displayed bibliographic records. Let us illustrate this phenomenon in an example. A user at UM-Dearborn entered the single letter query "j" into the Pinstripe System. The system displayed two titles retrieved in a keyword-in-record search. The user gave one title a "very useful" rating and the other title a "possibly useful" rating even though the two records were on totally different subjects, viz. CAD and jet propulsion systems.
This example and the relevance assessments made by many other users in unusable search administrations made us skeptical about the reliability of relevance assessments and post-search questionnaire responses. We re-examined the method we used to gather such data to determine how it could be changed in future administrations of online questionnaires.
Figure 2 shows the Blue System's display of a title retrieved in a search for "civil rights" bearing a pop-up window that gives users three categories for rating usefulness: (1) "very useful," (2) "possibly useful," and (3) "not useful." The user could click on one of these three categories using the mouse or use a combination of arrow keys and <Enter> key to select a category. The first-listed category (i.e., "very useful") was given in white type and set in a medium-blue color against a light-blue background. The other categories were not highlighted but they were given in light yellow type and set against a medium-blue background. The simplest action for the user to take was to hit the <Enter> key. This made the system choose the first-listed category, i.e., "very useful" and record it to transaction log records. Reviewing relevance assessments for unusable queries, the researchers felt that many users performed this simple action in response to relevance assessment pop-up boxes. This action could have been more prevalent in searches for unusable queries that were meaningless, were for play, sex terms, or expletives, because users were not serious about their searches.
| Events | Completed events | ||||
|---|---|---|---|---|---|
| Pre-search questionnaire responses | x | x | x | x | x |
| Pinstripe System search | x | x | x | x | |
| Blue System search | x | x | x | x | |
| Post-search questionnaire responses | x | ||||
| Suspect post-search questionnaire responses | x | ||||
| Number of usable search administrations (N=528) |
71 (13%) | 74 (14%) | 75 (14%) | 80 (15%) | 228 (43%) |
A large percentage (43%) of usable administrations of the Comparison Search Experiment were full administrations, that is, users completed all four experimental events. Of the four partial-administration categories, the largest percentage (15%) contained the four complete events; however, responses to the post-search questionnaire were suspect. That is, post-search questionnaire responses were default responses that users could have selected by repeatedly pressing the <Enter> key or by repeatedly entering the same number. In the data analysis, the researchers considered these responses suspect, discarded them, and combined the remaining events in these administrations with search administrations in which pre-search questionnaires and searches in Blue and Pinstripe Systems had been completed. The latter administrations accounted for 14% of usable search administrations. When they were combined with the former, the two categories of usable search administrations amounted to almost a third (29%) of usable search administrations.
Another problem with post-search questionnaire responses was similar to the problem with relevance assessments. In unusable and usable search administrations, users repeatedly selected the first-listed response category in post-search questionnaires. A manual review of searches and relevance assessments made the researchers skeptical about the reliability of these post-search questionnaire responses. We examined the method used to gather such data to determine how it could be changed in future administrations of online questionnaires.
The experimental online catalog displayed each question and response categories separate from other questions and response categories. The first-listed category was given in a bright yellow type and set in a medium-orange color against a dark blue background. The other categories were given in a light yellow type against a dark blue background. The simplest action for the user to take was to hit the <Enter> key. This would make the system choose the first-listed category and save it to transaction log records.
In the analysis of usable search administrations, the researchers disregarded post-search questionnaires in which users selected all first-listed response categories.
If the ASTUTE project team could redesign the post-search questionnaire, they would make the selection process a much more deliberate step on the part of the user than just hitting the <Enter> key. There are several possibilities and combinations of possibilities for redesigning the selection process. If the first-listed category were to remain highlighted, it could be changed to a category such as "don't know" or "don't care." First-listed categories could also be different from question to question. When users selected one or more such categories, the truthfulness of their responses to other post-search questions could be in doubt. Another approach would involve requiring users to type in a number or letter corresponding to the category of their choice. This would make their selection more deliberate than repeatedly hitting the <Enter> key.
The catalog ushered users through a tightly controlled sequence of events and performed and/or recorded all activity connected with data collection including searcher recruitment, questionnaire administration, and logging of users' searches and relevance assessments. In the absence of human intermediaries to monitor ongoing searches, system users were free to complete as little or as much of the test administration as they wanted and enter queries and responses that were unsuitable to the search administration.
In view of our experience delegating data-collection procedures to an experimental online system and the analysis of the data that this system collected, we would introduce the following changes and enhancements to future experiments that enlist systems to perform data-collecting procedures:
Dickson, Jean. (1984). An analysis of user errors in searching an online catalog. Cataloging & Classification Quarterly 4, 3 (Spring): 19-38.
Drabenstott, Karen M., Vizine-Goetz, Diane. (1994). Using subject headings for online retrieval: Theory, practice, and potential. San Diego: Academic Press.
Drabenstott, Karen M., and Weller, Marjorie S. (1995). Testing a new design for subject access to online catalogs. Ann Arbor, Mich.: School of Information and Library Studies.
Drabenstott, Karen M., and Marjorie Weller. Testing a new design for subject searching in online catalogs. Library Hi Tech 12, 1: 67-76.
Hancock-Beaulieu, Micheline, Lorna McKenzie, and Avril Irving. (1991). Evaluative protocols for searching behaviour in online library catalogues. London: British Library. British Library R&D Report 6031.
Henty, Margaret. (1986). The users at the online catalogue: a record of unsuccessful keyword searches. LASIE 17, 2 (September/October): 4-52.
Hildreth, Charles R. (1993). An evaluation of structured navigation for subject searching in online catalogues. Ph.D. dissertation. Department of Information Science, The City University, London.
Hunter, Rhonda N. (1991). Successes and failures of patrons searching the online catalog at a large academic library: a transaction log analysis. RQ 30, 3 (Spring): 395-402.
Jones, Richard M. (1988). A comparative evaluation of two online public access catalogues. London: British Library. British Library Research Paper 39.
Lester, Marilyn Ann. (1989). Coincidence of user vocabulary and Library of Congress Subject Headings: experiments to improve subject access in academic library online catalogs. Ph.D. dissertation, University of Illinois at Urbana-Champaign.
Markey, Karen, and Anh N. Demeyer. (1986). Dewey Decimal Classification online project: Evaluation of a library schedule and index integrated into the subject searching capabilities of an online catalog. Dublin, Ohio: OCLC. OCLC Research Report OCLC/OPR/RR-86/1.
Peters, Thomas A. (1989). When smart people fail: An analysis of the transaction log of an online public access catalog. Journal of Academic Librarianship 15, 5 (November): 267-73.
Siegel, Elliott R. et al. (1983). Research strategy and methods used to conduct a comparative evaluation of two prototype online catalog systems. In National Online Meeting proceedings -1983, 1983 April 12-14, New York, compiled by Martha E. Williams and Thomas H. Hogan, 503-11. Medford, N.J.: Learned Information.
University of California, Division of Library Automation and Library Research and Analysis Group. (1982). Users look at online catalogs: results of a national survey of users and non-users of online public access catalogs: final report to the Council on Library Resources. Berkeley, Calif.: University of California, November 16.
Walker, Stephen, and Richard M. Jones. (1987). Improving subject retrieval in online catalogues; 1. Stemming, automatic spelling correction and cross-reference tables. London: British Library. British Library Research Paper 24.
Walker, Stephen, and Rachel De Vere. (1990). Improving subject retrieval in online catalogues; 2. Relevance feedback and query expansion. London: British Library. British Library Research Paper 72.
Walter, Dennis R. (1987). The user at the online catalogue: A record of unsuccessful keyword searches - another case study. LASIE 18, 3 (November/December): 74-81.
© 1996, American Society for Information Science. Permission to copy and distribute this document is hereby granted provided that this copyright notice is retained on all copies and that copies are not altered.