|START Conference Manager|
ASIST 2012 Annual Meeting
Baltimore, MD, October 26-30, 2012
Least Information Document Representation for Automated Text Classification
We proposed the Least Information theory (LIT) to quantify meaning of information in probability distributions and derived a new document representation model for text classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, LIT offers an information-centric approach to weight terms based on probability distributions in documents vs. in the collection. We developed two term weight quantities in the document classification context: 1) LI Binary (LIB) which quantifies (least) information due to the observation of a term's (binary) occurrence in a document; and 2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. Both quantities are computed given term distributions in the document collection as prior knowledge and can be used separately or combined to represent documents for text classification. We conducted classification experiments on three benchmark collections, in which proposed methods showed strong performances compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combined LIB and LIF, outperformed TF*IDF in several experimental settings. Despite its similarity to TF*IDF, the formulation of LIB*LIF is very different and offers a new way of thinking for modeling information processes beyond classification.