The goal of this long-term project is to develop a relatively simple measurement scale, based on user criteria, that will yield results applicable to the study of user evaluations in any type of information seeking and use environment. The project consists of a series of validation tests in which a set of user criterion terms is progressively clarified and consolidated into smaller sets of terms through conceptual analyses of responses from subsequent groups of different
users. This paper describes the first two tests, which were conducted to determine how users interpret criterion terms drawn from previous user-based relevance studies. These exploratory tests revealed conceptual and methodological challenges involving the semantic ambiguity of terms and verified the importance of situational context in scale development and use. Recognizing and dealing with these challenges has moved the project forward considerably and will affect the final design of the scale.
In the past decade, a number of researchers collectively have elicited more than 100 criteria directly from end-users describing their perceptions of information sources in the context of their own information problem situations. Even casual examination of their findings reveals conceptual overlaps among the majority of user criteria, with concepts such as "aboutness" or subject similarity, accuracy, and availability occurring in every study. These overlaps confirm a longstanding belief in the existence, at some broad level, of a finite range of criteria--perhaps a dozen or fewer--that are applied by users regardless of type of situation (see Schamber, 1994). A measurement scale based on common user criteria potentially would yield far more useful data than the simple judgments of relevant/not relevant traditionally employed in most IR evaluation research. The data could be applied, for example, in pinpointing areas for improvement in content and presentation of IR output.
Criteria from three previous studies were chosen for initial validation testing. In all three studies, users' descriptions of their information problem situations and decisions were elicited through some form of open-ended questioning and their responses analyzed in order to identify criteria.
Schamber (1991) interviewed professional users of weather information employed in aviation, electric power utilities, and construction. She asked them to evaluate their information sources, which included weather information systems, mass media, themselves (personal experience), and other people. Analysis of their responses yielded 10 criteria at a broad summary level for evaluating information, source, and presentation, and 21 criteria at a narrower detail level (see also Barry & Schamber, 1995).
The other two studies involved faculty and students in several disciplines who had requested searches for text-based information in an academic library setting. Su (1993) conducted a post-search interview in which she asked users to explain their ratings of the overall success of the search. Her analysis of their explanations yielded 26 success dimensions or criteria. In addition to criteria for retrieval output, these included criteria for quality of IR interaction.
Barry (1994) conducted a post-search interview in which she asked users to evaluate retrieved items individually and explain their evaluations. Her analysis identified 23 criteria in seven categories: information content of documents, the user's previous knowledge, the user's beliefs and preferences, other information and sources within the information environment, sources of documents, document as physical entity, and the user's situation.
These three studies seemed to offer the most comparable and thoroughly defined criteria elicited directly from a wide variety of users in different types of information problem situations. The set of criteria selected for the first tests was not intended to be comprehensive, but rather to provide reasonable initial coverage of criteria across information settings. In every test, respondents will be encouraged to add criteria of their own.
Judgments based on user criteria, however, raise fascinating questions about the relative importance of various criteria, tradeoffs in criterion importance, interrelationships among criteria, relationships between criteria and tangible document and system features, and relationships between criteria and users' information problem situations and stages of information seeking. These questions strongly suggest the need for a valid and reliable criterion-based instrument for collecting evaluation data in any type of information setting. The potential for using relevance criteria in the development of such an instrument was demonstrated by Wang (1994), who developed a flexible model of the document selection process based in part on 11 user criteria drawn from the studies by Barry, Schamber, and others. In Wang's model, final document selection is based on decision rules and tradeoff principles such as elimination (finding reasons to reject) and multicriteria (acceptance based on more than one criterion).
Ideally, a truly user-based instrument will be derived from users' own criteria and validated throughout its development by subsequent groups of different users. This process will not be easy. Researchers in the original user criteria studies were considerably challenged by the semantic ambiguities of natural language when interpreting, categorizing, and coding users' responses for analysis. It is difficult to predict how subsequent users will interpret and organize the criteria, especially outside the context of the original problem situations, without many rounds of testing.
On the other hand, quantifying user judgments is not difficult. The benefits of ranking, category rating scales, and continuous scales for assessing relevance have been promoted since the earliest relevance research in the 1960s (see Schamber, 1994). A particularly useful continuous-scale measurement technique is magnitude estimation, which is said to be more reliable than category scaling and easier to administer and analyze (see Eisenberg, 1988; Rorvig, 1988). The potential of magnitude estimation for instrument development was demonstrated by Janes (1991) in his motion index, used to record rapid changes in relevance judgments over time. Janes employed a magnitude estimation technique in which respondents made a mark on a 100mm line indicating their judgments of the degree of relevance as progressively more elements of citations were shown to the respondent. It is reasonable to suppose that judgments based on relevance criteria can be measured in the same way.
This paper presents the results of the first two tests in the long-term project. The goal of the project is to develop a user-validated measurement scale based on user criteria for relevance evaluation. It is expected that the final scale will:
The test consisted of three parts: (1) a sorting task in which respondents sorted the cards by placing cards that represented the same concept into the same pile, with no limit on the number of piles; (2) a naming task, in which they labeled each card pile with the term that best described the concept represented by the pile as a whole; and (3) an interview in which they were asked to explain how they had sorted the criteria, why they had put particular criteria together, and what concept each pile represented. Subsequent questions probed for responses regarding the criteria most important to the respondent, those easiest to sort and most difficult to sort, and any additional comments about the difficulty of the tasks. Interview responses were recorded on paper and on audiotape. It was emphasized that there were no time limits on the tasks, and no right or wrong ways to group and label the cards or answer the questions.
When students submitted their topic proposals, they were asked to perform five tasks: (1) relevance judgment, in which they chose the one citation from their topic proposal they expected to be the best source for their paper; (2) explanation, in which they stated why they selected this as the best source, where they found it, and whether they had actually read it yet; (3) evaluation, in which they selected cards with criterion terms that applied to this source; (4) sorting, in which they grouped the selected cards in piles with others that seemed to represent the same concept (putting the card with the most appropriate term on top); and (5) ranking, in which they numbered the piles in order of the importance of the criteria in determining choice of the citation. Data consisted of written responses, including open-ended explanations, and numbered card piles.
After the paper was completed, students in one class section were asked to select the best source from the citation list in their papers. This source may or may not have been the same they selected from their topic proposal. On a two-page printout of the 83 criteria, they selected, grouped, and ranked the criteria that applied to this source. Eleven respondents completed the post-assignment tasks; seven completed both pre- and post-assignment tasks. Data consisted of the marked-up criterion pages and written responses.
In Test 2, respondents were generally consistent in grouping certain terms as synonyms. Table 3 shows the five major groups, including the frequency with which all terms appeared in a given group and the frequency with which some terms appeared in a given group. No one term in these groups was consistently selected as the most appropriate term. The meanings of the terms tended to represent characteristics such as currency that could be determined easily by users. Beyond these major groups, a few terms were handled in unexpected ways. For example, several respondents used both in-depth and overview to describe the same source instead of interpreting them as mutually exclusive concepts. Other responses, such as the grouping of interpretive and descriptive as synonyms, apparently reflected unfamiliarity with these distinct types of research approaches as mentioned by users in the original studies. One respondent selected applied and added "to my topic" to the card.
Despite instructions to place positive and negative forms of the same concept (e.g., familiar and unfamiliar) in the same concept piles, several Test 1 respondents placed them in separate piles. Some even searched for antonyms, as did the respondent who said, "I don't have a term to match lightweight, I don't have heavy." Also, because the criteria were presented as phrases (it is clear, he/she is well known, I like it), some respondents created "it," "he/she," and "I" piles and one had "1st party" and "3rd party" piles.
Test 2 respondents were far more consistent in their approaches to conceptual sorting. Table 3 shows the major groups, pertaining generally to concepts of aboutness, currency, availability, clarity, and credibility. A majority of the respondents selected at least some terms in the top two groups, with 71% choosing aboutness concepts and 64% choosing currency concepts.
Despite differences in conceptual approaches to the sorting task in Tests 1 and 2, the number of term card piles remained the roughly the same. In Test 1, with 119 terms, the number of piles ranged from 3 to 17, with an average of 10 per respondent. In Test 2, with 83 terms, the number of piles ranged from 1 to 20, with an average of 7 per respondent.
The number of respondents who selected certain criteria (Table 2) ranged from 28 (100%) for about my topic to 0 for 10 criteria. A dozen criteria were selected by 14 (50%) or more respondents. The 10 criteria not selected at all either represented negative qualities, such as cluttered, or seemed unlikely to have pertained to the source being evaluated, such as interactive. In some cases respondents favored one of a pair of mutually exclusive criteria, as did the 5 who apparently understood and selected the research approach theoretical versus the 10 who selected applied. Two additional criteria generated by respondents concerned the inclusion in documents of interviews and of case studies.
One area that remained problematic in Test 2 was negative applications of criteria. Because many Test 1 respondents did not place positive and negative forms of the same concept (e.g., familiar and unfamiliar) in the same concept piles, we deleted most negative forms from the Test 2 term set. We then instructed Test 2 respondents to write "not" on a card (e.g., not familiar) if they had applied the criterion in a negative sense. Few chose to do so, possibly because the instruction was still confusing or because they simply had no negative evaluations of "best" documents. One respondent did add "not" to confusing and to theoretical.
Criterion rankings generally followed selection frequencies (Table 2) and criterion concept groups (Table 3). Concepts of aboutness were consistently ranked highest, with 23 respondents (82%) selecting about my topic or relevant or pertinent as a primary reason for selecting the source. This is not surprising in view of the fact that respondents were asked to evaluate the potential best source for their topical papers; it is also consistent with the findings of past research. The concept of currency, represented by the terms current, recent, and up-to-date, was also important: 8 respondents ranked this concept in their first or second group and 8 respondents ranked it in their third group. This was consistent with the class paper assignment, which required that sources be published within the last five years. Other criteria that appeared in highly ranked groups were available and/or easy to get and/or accessible; readable and/or well-written and/or understandable; reliable and/or credible; and expert and/or well-known.
Perhaps the greatest challenge in any relevance research, from the researcher's point of view, is disentangling semantic and conceptual problems from methodological problems (see Schamber, 1994). Overall, the results of our exploratory user tests of relevance criterion concepts were successful in helping us perceive various matters through users' eyes, notably the challenge of interpreting the meanings of criterion terms and that of performing certain tasks. Further, our results, like those of previous studies, suggest that the same factors that affect relevance judgments per se also affect performance of relevance judgment tasks. This is most evident in the importance of situational context and user knowledge in relevance evaluation.
The difference between sorting relevance criterion concepts outside the context of a personal information problem situation in User Test 1 and dealing with them within such a context in User Test 2 clearly affected respondents' task performance. Test 1 respondents tended to "made sense" of the criteria by creating their own contexts for sorting the terms. Often these contexts were not suggested by our task instructions nor by our description of the previous studies that served as sources of the criterion terms. On the other hand, most Test 2 respondents performed the sorting task with relative ease and generally appeared to have more confidence in their decisions. Interestingly, several Test 1 respondents seemed to be bothered by what they perceived as highly subjective criteria such as he/she is prominent and I like it. One stated, "The criteria make a lot of assumptions and value judgments of an author's work." Another said, "It validates my viewpoint--I don't like this because it is possible my viewpoint isn't right." A few dismissed these more affective criteria, saying they would not use them. One said,"My favorite doesn't have a place in my search" and "I like it makes no difference [to me]." Test 2 respondents, however, applied such concepts frequently and often ranked them as important in selecting their information source.
A less conspicuous observation concerned the importance of user knowledge and experience. This was most evident in Test 2 evaluations based on the criteria. Unlike many respondents in the original two studies conducted in academic libraries, respondents in this study were not expert scholars or researchers. Thus their recognition of terms for research approaches apparently varied, from those who grouped interpretive and descriptive as synonyms to those who selected either theoretical or applied.
Finally, with respect to methodology, although the first two tests suffered from ambiguities in the criterion term set and awkwardness in the task instructions, they did serve their exploratory purpose in allowing us to identify and begin to address specific problems in developing the user criterion scale. Subsequent tests will further refine the criterion term set and explore various techniques for collecting evaluation data based on the criteria. As of this writing, Test 3 has been administered and data analysis begun. The stimulus set consists of 55 criterion terms. Respondents were 46 graduate students in information/library science who were asked, after completing their final class papers, to choose the best information source for their papers and apply the criteria in evaluating it. Preliminary results show that each criterion concept was selected at least once and several more were suggested. In Test 4, soon to be administered to students in a field outside information/library science, the stimulus set will consist of 40 criterion terms, evaluations will involve magnitude estimation, and the results will be statistically analyzed. Further tests will be administered to users in various non-academic settings, including everyday information problem situations.
Barry, C. & Schamber, L. (1995). User-defined relevance criteria: A comparison of two studies. (1995). Proceedings of the 58th Annual Meeting of the American Society for Information Science, 32, 103-111.
Eisenberg, M. (1988). Measuring Relevance Judgments. Information Processing & Management, 24 (4) , 373-389.
Janes, J. W. (1991). Relevance Judgments and the Incremental Presentation of Document Representations. Information Processing & Management. 27 (6) , 629-646.
Rorvig, M. E. (1988). Psychometric Measurement and Information Retrieval. In: Williams, Martha E., ed. Annual Review of Information Science and Technology, Vol. 23. Amsterdam, The Netherlands: Elsevier Science Publishers for the American Society for Information Science, 157-189.
Schamber, L. (1991). Users' Criteria for Evaluation in a Multimedia Environment. In: Griffiths, Jos‚-Marie, ed. ASIS '91: Proceedings of the American Society for Information Science (ASIS) 54th Annual Meeting, 28, 126-133.
Schamber, L. (1994). Relevance and information behavior. In Martha E. Williams (Ed.), Annual Review of Information Science and Technology, 29, 3-48.
Su, L. T. (1993). Is Relevance an Adequate Criterion for Retrieval System Evaluation: An Empirical Inquiry into the User's Evaluation. ASIS '93: Proceedings of the American Society for Information Science (ASIS) 56th Annual Meeting, 30, 93-103.
Wang, P. (1994). A Cognitive Model of Document Selection of Real Users of IR Systems. Unpublished doctoral dissertation, University of Maryland, 1994.
| About my topic | He/she has style [1] | Portable [1] |
| Accessible | I hate it [1] | Precise |
| Accurate | I have heard of it [1] | Prestigious |
| Ambiguous | I can edit it [1] | Prominent |
| Applied | I have immediate access [1] | Proven |
| Appropriate | I have control [1] | Quality work [1] |
| Available | I have a background in it | Readable |
| Big/small picture [1] | I know it firsthand [1] | Recent |
| Boring [1] | I am aware of this [1] | Relevant |
| Broad | I have input [1] | Reputable |
| Clear | I agree with it | Saved time |
| Cluttered | I already have it | Saved effort |
| Color [1] | I have power [1] | Specific |
| Complete | I can zoom in/out [1] | Speculative [1] |
| Comprehensive | I know the author | Summary |
| Concise | I know the publication | Theoretical |
| Confusing | I know the source | Thorough |
| Consistent | I can track it [1] | Too long |
| Controversial | I like it | Trivial |
| Convenient | I thought of it [1] | Trustworthy [1] |
| Credible | I can request it [1] | Understandable |
| Current | Important | Unfamiliar [2] |
| Cursory | In-depth | Unique |
| Descriptive | Incorrect [1] | Unreliable [2] |
| Detailed | Inexact [1] | Usable |
| Difficult to get [1] | Interactive | User-friendly |
| Easy to get | Interesting | Vague |
| Enjoyable | Interpretive | Validates my viewpoint |
| Expensive | Introductory | Well-known |
| Explanation [1] | It has methodology | |
| First-rate [1] | It is my favorite [1] | |
| Focused | Lightweight [1] | |
| Free | Local [1] | |
| Geographic area [1] | Misleading | |
| Has procedures [1] | Narrow | |
| Has examples | New to me | |
| Has illustrations | Novel [1] | |
| Has techniques | Only source | |
| Has bibliography | Original | |
| Has time periods [1] | Outdated [1] | |
| Has tables | Overview/closeup | |
| Has graphics [1] | Overview | |
| Has references | Pertinent | |
| He/she is expert | Poorly written [1] | |
| He/she has personality [1] | Popular | |
| Total terms=119 | ||
1. Deleted for User Test 2
2. Changed to positive terms (familiar, reliable) for User Test 2
| Term | Frequency | Term | Frequency |
|---|---|---|---|
| About my topic | 28 | General or specific | 5 |
| Appropriate | 24 | I already have this | 5 |
| Current | 24 | Introductory | 5 |
| Relevant | 24 | Precise | 5 |
| Pertinent | 21 | Theoretical | 5 |
| Recent | 21 | Accurate | 4 |
| Usable | 21 | Complete | 4 |
| Descriptive | 20 | Prominent | 4 |
| Available | 16 | Thorough | 4 |
| Readable | 16 | Well-known | 4 |
| Up-to-date | 16 | Consistent | 3 |
| Interesting | 14 | Controversial | 3 |
| Understandable | 13 | Enjoyable | 3 |
| Accessible | 12 | Familiar | 3 |
| Credible | 12 | Free | 3 |
| Easy to get | 11 | Has tables | 3 |
| Focused | 11 | I don't have a background in this | 3 |
| Overview | 11 | Saved effort | 3 |
| Applied | 10 | Summary | 3 |
| Comprehensive | 10 | Too long or short ( [1]) | 3 |
| Detailed | 10 | Confusing ( [1]) | 2 |
| I like it | 10 | I agree with it | 2 |
| In-depth | 10 | Unique | 2 |
| Provides proof | 10 | Validates my viewpoint | 2 |
| Important | 9 | Cursory | 1 |
| Provides examples | 9 | I know the author | 1 |
| User-friendly | 9 | It is the only source | 1 |
| Clear | 8 | Popular | 1 |
| Has bibliography | 8 | Prestigious | 1 |
| I know the publication | 8 | Ambiguous | 0 |
| New to me | 8 | Cluttered | 0 |
| Convenient | 7 | Expensive | 0 |
| Describes methodology | 7 | Has illustrations | 0 |
| Expert | 7 | Interactive | 0 |
| Interpretive | 7 | Misleading | 0 |
| Reliable | 7 | Original | 0 |
| Reputable | 7 | Saved money | 0 |
| Saved time | 7 | Trivial | 0 |
| Concise | 6 | Vague | 0 |
| Describes techniques | 6 | ||
| I know the source | 6 | ||
| Provides history | 6 | ||
| Well-written | 6 | ||
| Broad or narrow | 5 | ||
|
Total terms=83 n=28 respondents Data represents number of respondents who selected criterion for making evaluation | |||
1. "Not" added to criterion term by respondent
| Group Concept | Criteria in Group | All Terms in Group | Some Terms in Group |
|---|---|---|---|
| Aboutness | About by topic | 3 | 20 |
| Appropriate | |||
| Pertinent | |||
| Relevant | |||
| Usable | |||
| Currency | Current | 16 | 18 |
| Recent | |||
| Up-to-date | |||
| Availability | Available | 4 | 5 |
| Accessible | |||
| Convenient | |||
| Easy to get | |||
| Clarity | Clear | 3 | 7 |
| Readable | |||
| Understandable | |||
| Credibility | Credible | 2 | 5 |
| Expert | |||
| I know the publication | |||
| I know the source | |||
| Prominent | |||
| Reliable | |||
| Reputable | |||
| Well-written | |||
|
Total terms=83 |
|||