Recall and precision are two evaluation techniques used in many IR systems (Salton & McGill, 1983). Recall is the percentage of relevant documents retrieved by a client's question; precision is the percentage of the retrieved documents that are relevant. The information seeking behavior that these measures capture is that of a client who is interested in retrieving all relevant answers.
We observed that the information seeking behavior of CIE clients is different. Typically, they want to find the first relevant answer fast. One way to measure this is to count the number of interactions the client has with the system after the question is submitted and before the first relevant answer is found. This number is averaged over all submitted questions. Hence, the measure is called the average number of interactions (ANI).
To estimate the system's ANI, 10 subjects were asked to write 20 questions each about Common Lisp and artificial intelligence (AI). The obtained 200 questions were manually screened for having at least one answer in the available Q&A collections, which resulted in the test body of 105 questions. Five different subjects were each given a random sample of size 10 taken without replacement from the test questions. For each question, the subjects reported the number of interactions before the first relevant answer was found. The following interactions were counted: submitting a question, selecting a topic when multiple topics are retrieved, requesting a topic's description when a topic is unfamiliar, selecting an expert when multiple experts are retrieved, requesting an expert's expertise on a topic, requesting another Q&A.
Figure 7: Average Number of Navigation Interactions
Figure 8: Average Number of Q&A Interactions
The results are summarized in Figure 7. We explain the differences in ANI by the differences in the term ambiguity of different samples. Most terms in the questions given to the fourth subject identified the correct topic uniquely. For example, in the question ``How does garbage collection work in Lisp?'', the terms ``garbage10'' and ``lisp10'' lead to the retrieval of the vectors on the path to the topic of Common Lisp. Most terms in the questions given to the second subject were indicative of multiple topics. For example, the terms of the question ``Do Lisp process schedulers use the round robin algorithm?'' retrieved the topics of Common Lisp, Operating Systems, and Theory. Consequently, the second subject had to clarify her search preferences to the system on several occasions.
We also experimented with the relative importance of
and
in Equation 3. We chose a Q&A collection
of 140 Q&A's about Common Lisp and AI written by a registered
expert. To simulate a dynamically growing collection, we split
it into Q&A subsets whose cardinalities were multiples of 20:
the first 20 Q&A's, the first 40 Q&A's, etc. For each subset,
we chose the subset of the 105 test questions answered in it.
For each question, an ordered list of matches was computed, and
the number of nonrelevant Q&A's before the first relevant one
was counted. This number was averaged for every Q&A set.
The activation depth was 2;
in Equation 3 was
1.0;
and
in the same equation were set
to .7 and .3, respectively, in the first experiment, and to .3
and .7, respectively, in the second one. The ANI's are given in
Figure 8. The table shows that on collections of 20
and 40 Q&A's the metric which valued
higher than
achieved smaller ANI's, while the other metric was
more successful on larger collections.