next up previous
Next: Spreading activation in WordNet Up: HOW THE SYSTEM WORKS Previous: HOW THE SYSTEM WORKS

Metrics of the vector space model

To compute term weights, two of the term weight metrics of the vector space model are used. The first metric is commonly referred to as tfidf (Salton and McGill, 1983). Let tf be the term frequency (the number of times a term a appears in a Q&A D), N be the total number of Q&A's, and tex2html_wrap_inline486 the number of Q&A's containing a. Then a's tfidf in D, tex2html_wrap_inline496 , is tex2html_wrap_inline498 .

The second metric is condensation clustering (Bookstein, Klein, & Raita, 1998). Unlike with tfidf, indexing terms are valued on the basis of their patterns of occurrence in a sequence of Q&A's. The terms that do not bear content appear to be distributed randomly over the Q&A's, while deviations from randomness indicate content.

Condensation clustering (CC) of a is a ratio of the actual number of Q&A's containing at least one occurrence of a over the expected number of such Q&A's, assuming a random distribution. Let U be the total number of Q&A's in the collection. Let a random variable tex2html_wrap_inline508 be 1 if the i-th Q&A contains a and 0 otherwise. Let T be the number of occurrences of a in all Q&A's. The number of units containing a is tex2html_wrap_inline520 , with expected value tex2html_wrap_inline522 . Since each tex2html_wrap_inline508 is binary, its expected value is the probability that tex2html_wrap_inline526 , which is tex2html_wrap_inline528 . Hence, tex2html_wrap_inline530 . If N is the actual number of Q&As containing a, a's CC is tex2html_wrap_inline538 . Since the terms for which tex2html_wrap_inline540 are unlikely to have much content, we modify the CC measure, tex2html_wrap_inline542 , to be tex2html_wrap_inline544 if tex2html_wrap_inline546 , and 0 otherwise. tex2html_wrap_inline542 indicates whether a bears content in a collection of Q&A's as a whole. But, when a's weight is computed with respect to D, we need to take into account how important a is in D. This is done by multiplying a's CC by its frequency, tf, in D. Thus, a's weight in D, tex2html_wrap_inline570 , becomes

  equation121


next up previous
Next: Spreading activation in WordNet Up: HOW THE SYSTEM WORKS Previous: HOW THE SYSTEM WORKS

Val Kulyukin
Thu Mar 19 09:57:35 CST 1998