next up previous
Next: DISCUSSION Up: An Interactive and Collaborative Previous: The client's feedback during

EVALUATION

Recall and precision are two evaluation techniques used in many IR systems (Salton & McGill, 1983). Recall is the percentage of relevant documents retrieved by a client's question; precision is the percentage of the retrieved documents that are relevant. The information seeking behavior that these measures capture is that of a client who is interested in retrieving all relevant answers.

We observed that the information seeking behavior of CIE clients is different. Typically, they want to find the first relevant answer fast. One way to measure this is to count the number of interactions the client has with the system after the question is submitted and before the first relevant answer is found. This number is averaged over all submitted questions. Hence, the measure is called the average number of interactions (ANI).

To estimate the system's ANI, 10 subjects were asked to write 20 questions each about Common Lisp and artificial intelligence (AI). The obtained 200 questions were manually screened for having at least one answer in the available Q&A collections, which resulted in the test body of 105 questions. Five different subjects were each given a random sample of size 10 taken without replacement from the test questions. For each question, the subjects reported the number of interactions before the first relevant answer was found. The following interactions were counted: submitting a question, selecting a topic when multiple topics are retrieved, requesting a topic's description when a topic is unfamiliar, selecting an expert when multiple experts are retrieved, requesting an expert's expertise on a topic, requesting another Q&A.

 

  figure251


Figure 7: Average Number of Navigation Interactions

 

  figure260


Figure 8: Average Number of Q&A Interactions

The results are summarized in Figure 7. We explain the differences in ANI by the differences in the term ambiguity of different samples. Most terms in the questions given to the fourth subject identified the correct topic uniquely. For example, in the question ``How does garbage collection work in Lisp?'', the terms ``garbage10'' and ``lisp10'' lead to the retrieval of the vectors on the path to the topic of Common Lisp. Most terms in the questions given to the second subject were indicative of multiple topics. For example, the terms of the question ``Do Lisp process schedulers use the round robin algorithm?'' retrieved the topics of Common Lisp, Operating Systems, and Theory. Consequently, the second subject had to clarify her search preferences to the system on several occasions.

We also experimented with the relative importance of tex2html_wrap_inline644 and tex2html_wrap_inline646 in Equation 3. We chose a Q&A collection of 140 Q&A's about Common Lisp and AI written by a registered expert. To simulate a dynamically growing collection, we split it into Q&A subsets whose cardinalities were multiples of 20: the first 20 Q&A's, the first 40 Q&A's, etc. For each subset, we chose the subset of the 105 test questions answered in it. For each question, an ordered list of matches was computed, and the number of nonrelevant Q&A's before the first relevant one was counted. This number was averaged for every Q&A set. The activation depth was 2; tex2html_wrap_inline624 in Equation 3 was 1.0; tex2html_wrap_inline626 and tex2html_wrap_inline628 in the same equation were set to .7 and .3, respectively, in the first experiment, and to .3 and .7, respectively, in the second one. The ANI's are given in Figure 8. The table shows that on collections of 20 and 40 Q&A's the metric which valued tex2html_wrap_inline644 higher than tex2html_wrap_inline646 achieved smaller ANI's, while the other metric was more successful on larger collections.


next up previous
Next: DISCUSSION Up: An Interactive and Collaborative Previous: The client's feedback during

Val Kulyukin
Thu Mar 19 09:57:35 CST 1998