Each piece of free-text is stoplisted and put through a greedy morphological algorithm which reduces nonstoplisted words to their base forms. For example, assuming that ``books'' and ``walked'' are not stoplisted, the base form of ``books'' is ``book'' and the base form of ``walked'' is ``walk.'' For stoplisting, we use an extended version of the stoplist derived from the Brown corpus by Francis and Kucera (1982). The morphological algorithm is greedy because it takes the first part of speech whose rules allow it to obtain the base form. Given a word, it reduces it to its base form as a noun. If the reduction succeeds, the base form is tagged as a noun. Otherwise, the reduction is repeated for the verb, the adjective, and the adverb. If all rules fail, the word is tagged as a noun by default. No parsing or word sense disambiguation is done. The output of the stoplisting and the morphological analysis is a vector of unweighted a-terms. For example, the question ``What is a good first book on Scheme?'' becomes (book10, scheme10), assuming no activation is spread and ``what,'' ``is,'' ``a,'' ``good,'' ``first,'' and ``on'' are stoplisted.
The weight of a in D,
, where D is
a Q&A or a client's question, is computed by
where
,
, and
denote
how much importance is given to each metric.
is normalized by the cosine normalization: the square root
of the sum of the squares of the weights of D's a-terms;
is set to 1.0;
and
are set according to the number of vectors in the space:
in smaller spaces (less than 40 vectors),
outweighs
; in larger ones (more than 40), vice
versa. This reflects our empirical observations that
tends to be a better discriminator than
in smaller collections.