next up previous
Next: INDEXING AND RETRIEVAL OF Up: HOW THE SYSTEM WORKS Previous: Spreading activation in WordNet

Computation of terms and their weights

Each piece of free-text is stoplisted and put through a greedy morphological algorithm which reduces nonstoplisted words to their base forms. For example, assuming that ``books'' and ``walked'' are not stoplisted, the base form of ``books'' is ``book'' and the base form of ``walked'' is ``walk.'' For stoplisting, we use an extended version of the stoplist derived from the Brown corpus by Francis and Kucera (1982). The morphological algorithm is greedy because it takes the first part of speech whose rules allow it to obtain the base form. Given a word, it reduces it to its base form as a noun. If the reduction succeeds, the base form is tagged as a noun. Otherwise, the reduction is repeated for the verb, the adjective, and the adverb. If all rules fail, the word is tagged as a noun by default. No parsing or word sense disambiguation is done. The output of the stoplisting and the morphological analysis is a vector of unweighted a-terms. For example, the question ``What is a good first book on Scheme?'' becomes (book10, scheme10), assuming no activation is spread and ``what,'' ``is,'' ``a,'' ``good,'' ``first,'' and ``on'' are stoplisted.

The weight of a in D, tex2html_wrap_inline620 , where D is a Q&A or a client's question, is computed by

  equation169

where tex2html_wrap_inline624 , tex2html_wrap_inline626 , and tex2html_wrap_inline628 denote how much importance is given to each metric. tex2html_wrap_inline630 is normalized by the cosine normalization: the square root of the sum of the squares of the weights of D's a-terms; tex2html_wrap_inline624 is set to 1.0; tex2html_wrap_inline626 and tex2html_wrap_inline628 are set according to the number of vectors in the space: in smaller spaces (less than 40 vectors), tex2html_wrap_inline626 outweighs tex2html_wrap_inline628 ; in larger ones (more than 40), vice versa. This reflects our empirical observations that tex2html_wrap_inline644 tends to be a better discriminator than tex2html_wrap_inline646 in smaller collections.


next up previous
Next: INDEXING AND RETRIEVAL OF Up: HOW THE SYSTEM WORKS Previous: Spreading activation in WordNet

Val Kulyukin
Thu Mar 19 09:57:35 CST 1998