are different question categories that have different symbolic procedures. In combining question functions and statement elements, they identified a total of 21 categories with different symbolic procedures operating on the knowledge structures.
While Lehnert's work was primarily concerned with developing a question answering system, the concept of using natural language input has also been applied in the retrieval setting for searching document databases. Vickery and Vickery (1992) developed a system that interprets natural language inputs as Boolean queries for searching an online database. This system takes the relationships present in the language of the query and transforms them into Boolean relationships.
Other natural-language approaches that a number of researchers have taken to apply to information retrieval utilize frame semantics as developed by Filmore (1968). In applications of frame semantics, each document is represented by a case frame containing a number of slots. When the slots are filled, they map the relationships between concepts in the frame. Liddy (1993) developed a system incorporating this alternative representation for documents and queries. Using Subject Field Codes (SFCs) from the Longman's Dictionary of Contemporary English, Liddy's system generates summary-level semantic representations of each document. Queries are also represented as SFC vectors. This allows access to documents at the conceptual level. One advantage of a frame-based approach is that it retains the semantic relationships between terms - information that is discarded in most systems.
The above models all attempt, in one way or another, to capture and categorize the subtler nuances of human communication. If the categories are too broad, then much of the communication will not make sense. Elaboration and subtlety are prerequisites for any system that would attempt to model more fully the human question answering process. In information retrieval, however, the goal is more pragmatic: to improve system performance and to facilitate end-user acquisition of relevant information. Since what is sought is improvement rather than complete viability, one may be able to make progress with a lesser degree of sophistication.
Because the user is usually concerned with elaborating a partially known information need, the types of questions asked in information retrieval settings can be viewed as a subset of the types of questions humans asked. In a sense, the user is aware of what he does not know in this setting. Consequently, his questions will reflect this and will be structured in a way that specifies his information need.
To investigate the relation between the retrieval mechanism and the level of question elaboration, this study made use of the cystic fibrosis (CF) database, a small (1,239 documents) test collection of documents dealing with clinical and basic research issues associated with CF (see Shaw et al., 1990). The database has 100 queries associated with it and corresponding evaluation of the relevance of each documents for each query by three subject experts. Because queries were generated by a researcher/clinician active in the area of CF, questions can be seen as representative of the types of information needs that arise in the conduct of CF research and in the care of CF patients. Relevance judgments for each query consisted of a ternary decision for each document: highly relevant (weight = 2), somewhat relevant (weight = 1), not relevant (weight = 0). In the current study, two levels of relevance were examined: weights of 5 or greater (two subject experts rated the document as highly relevant and the third rated the document as somewhat or highly relevant) and weights of 1 or greater (at least one researcher rated the document as somewhat relevant). These weights can be thought of as corresponding respectively to searches requiring high precision and searches requiring high recall.
Conceptual Categories
One way of interpreting a question is to identify the members of its two basic component sets: known elements and information-need markers. Typically, the known elements are the terms of greatest discriminability. These elements are equivalent to the search terms one would first use when translating a question into a Boolean search. The information-need markers provide information about the relationship between the inquirer and his current knowledge of the subject under contemplation. For example, one question compiled for the CF database asks:
What is the incidence of male fertility in CF?
In this instance, the known elements are male fertility in CF and the information-need markers are what incidence. The questioner might translate this into the Boolean search, "male AND fertility AND CF." Note, however, that the person who asks the question "Does CF have an effect on male fertility?" would, in all likelihood, perform the same Boolean search though one might expect the set of relevant documents for each question to differ slightly. In essence, the level of knowledge of the questioner toward his area of investigation is being ignored by this Boolean approach. The conceptual categories of questions reported in this investigation were developed to assess whether this typically overlooked relationship between the questioner and his level of knowledge has an effect on retrieval outcomes.
Undoubtedly, the questions in the CF database do not represent all conceivable categories of questions used in a retrieval situation. Though the model developed was intended to be generalizable beyond the current investigation, it is not to be considered exhaustive. After careful examination of all 100 questions from the CF database, the questions were each grouped manually into one of five conceptual categories:
(1) Verification
(2) Causation
(3) Concept completion
(4) Association
(5) Disassociation
Some of the categories were taken from those derived by Lehnert (see above), though they are not necessarily equivalent. Because Lehnert was working on the process of story understanding, the emphasis of her conceptual categories is the determination of causation. Her model was felt to be too broad for the retrieval setting. Rather than emphasizing causation, the five categories delineated for this investigation emphasize relationships among question concepts.
Verification questions, also called "yes-no" questions, ask about the truth of an event and are usually satisfied with a yes or no answer. These questions are generally though not exclusively characterized by fewer syntactic relationships and a decreased use of prepositions. This category accounts for 17 of the questions compiled for the CF database. The most common feature of verification questions is that they do not begin with pronominal slots (e.g., "what," "how") to mark unknown or missing information. Structurally, verification questions are simple statement inversions. For example:
Is CF mucus abnormal?
This category of question does not imply that the set of possible answers will not be characterized by gradation, merely that the questioner does not himself know the truth of the statement "CF mucus is abnormal." If he believed this to be the case, he would have rephrased the question as "How is CF mucus abnormal?" or requested information on particular abnormalities.
While the above example is structurally the same question as "Is CF mucus normal?" the truth value of the implied assertion is exactly the opposite. Depending on the inquirers point of view, therefore, the specific inquiry contained in the verification question may have only a weak relationship to the subject of the query. Because the question's assertion has an unknown truth value and because it is precisely this truth value that is being sought, a system automatically assigning weight values to the terms in the query could make adjustment to allow for the possibility of non-truthful terms. An alternate approach would be for the system to search for both synonyms and antonyms.
Causation questions inquire into causal relationship between events (or states). This category accounts for 16 of the questions compiled for the CF database. In this category of questions, when one concept is introduced to another there is an unknown effect. The inquiry is into the characteristics or properties of the effect. Common causation indicators include words like "effects," "effective," and "role," though other words and markers may distinguish the category. In this category, a pronominal slot marker (e.g., "what") is generally paired with a causation indicator (e.g., "what effects") to indicate which specific aspects of the stated question elements are unknown. An example of a causation question is:
What are the effects of calcium on the physical properties of mucus from CF patients?
In this example, the stated question elements calcium and mucus from CF patients are known but the effect-relation between them has not been determined by the questioner. In a verification question (e.g., "Does calcium have an effect on the physical properties of mucus?"), a causal role has not yet been established. In causation questions, however, the effects are presumed to exist. In an information-seeking situation, the inquirer who is unable to attain a response to a causation question may modify the question into the less explicit verification question.
In some respects, concept completion questions represent the most prototypical cognitive model used in a retrieval situation. In this type of question, there is an inquiry about known question elements, but no causal or determinate relationship can be inferred. Often the question will ask explicitly or implicitly for a description or characteristic of the known question elements. According to Lehnert (1978),
Concept completion questions include many who-, what-, where-, and when-questions. These questions are very much like fill-in-the blank questions, insofar as they specify a particular event with one missing component and ask for completion of that event. (p. 69)
Concept completion questions represent a plurality (54) of the questions compiled for the CF database. Words associated with this category have to do with non-specific aspects of the known question elements: "abnormalities," "characteristics," "complications," "composition," "conditions," "evidence," "factors," "features," "frequency," "incidence," "manifestations," "prognosis," "properties." These high-frequency words are generally not very useful as search terms, but they serve to mark the relationship between the known question elements and the information need. A pronominal slot-marker will often be paired with one of these "empty" concepts (e.g., "what characteristics") to indicate that the unknown information is believed to exist, but the nature of this information and its relationship to the known question elements has not been determined. A concept completion question would be as follows:
What are the pathologic features of lung disease in CF patients?
In this example, the known question elements are lung disease in CF patients. The information requested concerns what pathologic features, which are presumed to exist but are not known.
Association and disassociation questions are similar to concept completion questions except that the inquiry pertains to multiple elements and the unknown information need is more narrowly focused. Association questions inquire into the associative relationship among known question elements. This category represents seven of the questions compiled for the CF database. In this category, there are always at least two known elements, and the information need pertains to the associative link between these two or more elements. Words marking the association category include "association," "concordance," and "relationship." These are usually combined with the pronominal slot-marker (e.g., "what association"). For example:
What is the association between liver disease (cirrhosis) and vitamin A metabolism in CF?
In the above example, the known question elements are liver disease and vitamin A metabolism in CF. The inquiry about these elements, what association attempts to form an explicit link between the two. In this example, the questioner has presupposed that a link between the two known elements has already been established, and that he wishes to use the retrieval system to find the specific nature of that link.
Disassociation questions inquire into the differences between known question elements. This category has the fewest questions compiled for the CF database (6). Oftentimes the only difference between these stated elements will be the modifier used to describe each of them. As with association questions, there are always at least two known elements, but the information need pertains to the differences between these two or more elements. Words marking the disassociation category include "different," "differently," and "differences." As with association questions, these are usually combined with the pronominal slot-marker (e.g., "what differences"). Oddly enough, while the category markers represent information needs, it seems that the modifiers of the category markers may themselves sometimes be known question elements. For example:
What structural or enzymatic differences are there between fibroblasts from CF patients and non-CF patients?
In this example, fibroblasts from CF patients and fibroblasts from non-CF patients are the known question elements. The information need is indicated by what differences. Yet structure and enzyme also appear to represent information that the questioner has already processed. In disassociation questions, because the information need is often elaborated in greater detail than is the information need in association questions, it may be judicious to consider category marker modifiers as known question elements rather than simply as part of the category marker.
There are certainly other categories that would apply to questions asked of information retrieval systems that were not represented in the sample set. Lehnert's model suggests a number of possible categories such as causal antecedent questions ("Why did X occur?") and quantification questions ("How many Xs are in Y?").
Method and Results
Two retrieval methods were chosen to investigate potential differences in outcomes across conceptual categories: the cosine measurement and the similarity measurement. Both produce ranked output which allows for a detailed comparison of results, but each reflects a different approach to document retrieval: the latter employs a term weighting scheme while the former does not.
Questions were assigned to the five categories described in the previous section. This investigation permitted the designation of only one category per question. In actuality, there is a certain amount of overlap between categories, and this restriction may not always be appropriate. Title and Abstract information was extracted from each document in the CF database, 250 common stop-words were removed, and terms were truncated using the Porter (1980) word stemming algorithm. The two different retrieval algorithms were then run on the database.
With the cosine measurement, document vectors were created for each of the 1,239 documents in the database. The cosine correlation, given below, measures the cosine of the angle between queries and documents when each is viewed as a vector in the multidimensional term space of dimension t, where t represents the number of possible term associations (see Salton & McGill, 1983). Document vectors were assigned binary weights for each term in the database (0 for present, 1 for absent). Similar vectors were created for each of the 100 associated questions. The cosine measure was then used to assess query-document similarity and to produce a ranked output of documents for each query.
In the second algorithm, the similarity of documents to each query were calculated according to the formula below. In this measure, the similarity of document i and query j is equal to the sum of the weights of their intersecting terms (Salton and McGill, 1983). The inverse document frequency (IDF) weight w for each term k is calculated by taking the log of the number of documents divided by the number of documents with term k (Robertson, 1974).
=
Once all queries were run and ranked output was calculated for both algorithms, precision and recall scores and E-measures were calculated to assess performance. The E-measure is a single-measure value that can be used to assess combined recall and precision. The formula used to calculate the E-measure is provided below.
Precision, recall, and E-measures reported in this analysis are those found at the optimal E-measures - the levels for each query where the value of E was greatest.
While it was predicted that the similarity measure using IDF weights would show better overall performance than the unweighted cosine measure, it was hypothesized that this performance differential might not hold to the same degree across all categories of questions. This prediction was not borne out, however. The similarity measure using IDF weights performed better across all categories, and the level of improvement was consistent across all categories. This was true at both the comprehensive (relevance = 1) and the specific (relevance = 5) retrieval expectations. These results are presented in Tables 1
and 2.
INSERT TABLES 1 AND 2 ABOUT HERE
Though the experimental results reported here reveal no differential effects of the two algorithms on the five categories of questions, two objectives have been accomplished. First, it was necessary to use a relatively simple standard such as the cosine measurement to establish a baseline against which other retrieval improvements could be measured. These results suggest that no category of questions will benefit from document retrieval that excludes term weights. Second, this experiment has demonstrated a method for assessing retrieval improvements that allows for a greater understanding of the overall results. Unfortunately, the uniformity of the improvements of one algorithm over the other in this experiment belie the advantage of this method.
Discussion
While this experiment revealed no significant insight into the relation between the structure of the question and the retrieval mechanism, there are a number of advantages to viewing question sets in terms of their categories. One advantage is that this approach does provides us with insight into the user's search model. While no general model will provide detailed insight into the specific thought processes of every possible user, we can use the conceptual categories to assess more narrowly what is being attempted and what is presumed to be true on the part of the user. Certain categories, such as causation, suggest a greater degree of certainty about what is known than do other categories, such as verification. Any system that effectively maps these different models to the same search strategy is losing a good deal of the information inherent in the structure of the question.
A second advantage is that this approach allows us to see whether current retrieval systems are responding equally well to different categories of questions. The retrieval literature contains numerous overall assessments of retrieval approaches. The assumption behind this approach is that, on average, the system will perform in a manner consistent with the reported results. The disadvantage of this approach is that it may mask some of the strengths and weaknesses of each system. In looking only at overall success, a given approach may be dismissed as not worth pursuing if it does poorly compared to other approaches. It is possible, however, that the approach in fact gave superior performance in one or two categories compared with previous approaches. These qualified successes are certainly worth noting as they may be incorporated into a more comprehensive approach to retrieval problems in general. On the other hand, approaches viewed as superior may be lacking in the way they handle one or two categories of questions. If these kinds of questions are more likely to be asked by researchers in certain fields or with a certain level of expertise, the approach that works best overall may not be the approach best suited to their needs.
Finally, this approach may assist us in identifying ways of more effectively mapping the search mechanism to the users perceived level of knowledge. A number of potential outcomes from this approach could lead to greater understanding of the retrieval process. It is possible that, for some categories of questions, the documents ranked with the highest value may not always be those most relevant to the question. One would expect this to be most true for the verification category, where the truth of the assertion inherent in the question is what is being queried. Documents which assume the truth of assertion implicitly and which would be relevant to the query could get lower rankings based only on term-matching and document-weighting schemes. Furthermore, if some algorithms work better for some sets of questions than for others, it may be possible to create a system that allows several different algorithms to be used. A front-end natural language interface could parse the input question to determine which category it falls into. Based on the category, the most effective algorithm could be assigned to that search.
References
Filmore, C.J. (1968). The case for case. In E. Bach & R. Harms (Eds.), Universals in linguistic theory (pp. 1-88). New York: Holt, Rinehart and Winston.
Graesser, A.C., Murachver, T. (1985). Symbolic procedures of question answering. In A.C. Graesser, J.B. Black (Eds.), The psychology of questions. Lawrence Hillsdale, NJ: Erlbaum Associates, 15-88.
Lehnert, W.G. (1978). The process of question answering: A computer simulation of cognition. Hillsdale, NJ: Lawrence Erlbaum Associates.
Liddy, E.D. (1993). An alternative representation for documents and queries. 14th National Online Meeting: proceedings, 1993, 279-284.
Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14 (3), 130-137.
Robertson, S.E. (1974). Specificity and weighted retrieval. Journal of documentation, 30, 41-46.
Salton, G., McGill, M.J. (1983). Introduction to modern information retrieval. New York: McGraw Hill.
Vicekry, B., Vickery, A. (1992). An application of language processing for a search interface. Journal of documentation, 48, 255-275.
© 1996, American Society for Information Science. Permission to copy and
distribute this document is hereby granted provided that this copyright notice is
retained on all copies and that copies are not altered.