To compute term weights, two of the term weight metrics
of the vector space model are used. The first metric is
commonly referred to as tfidf (Salton and McGill, 1983).
Let tf be the term frequency (the number of times a term
a appears in a Q&A D), N be the total number of Q&A's,
and
the number of Q&A's containing a. Then
a's tfidf in D,
, is
.
The second metric is condensation clustering (Bookstein, Klein, & Raita, 1998). Unlike with tfidf, indexing terms are valued on the basis of their patterns of occurrence in a sequence of Q&A's. The terms that do not bear content appear to be distributed randomly over the Q&A's, while deviations from randomness indicate content.
Condensation clustering (CC) of a is a ratio of the actual
number of Q&A's containing at least one occurrence of a
over the expected number of such Q&A's, assuming a random
distribution. Let U be the total number of Q&A's in the
collection. Let a random variable
be 1 if the
i-th Q&A contains a and 0 otherwise. Let T be the
number of occurrences of a in all Q&A's. The number of
units containing a is
, with expected
value
.
Since each
is binary, its expected value is
the probability that
, which is
.
Hence,
. If N is the actual
number of Q&As containing a, a's CC is
. Since
the terms for which
are unlikely to have
much content, we modify the CC measure,
, to
be
if
, and 0 otherwise.
indicates whether a bears content in a collection
of Q&A's as a whole. But, when a's weight is computed
with respect to D, we need to take into account how
important a is in D. This is done by multiplying a's
CC by its frequency, tf, in D. Thus, a's weight in
D,
, becomes