Identification of Effective Predictive Variables for Document Qualities
Kwong Bor Ng (Queens College, CUNY), Paul Kantor (Rutgers), Rong Tang (SUNY Albany), Robert Rittman (Rutgers), Sharon Small (SUNY Albany), Peng Song (Rutgers), Tomek Strzalkowski (SUNY Albany), Ying Sun (Rutgers), and Nina Wacholder (Rutgers)
We analyzed textual properties of documents to identify predictive variables for various document qualities by means of statistical and linguistic methods. We have created a collection of 1000 documents, each document has been judged in terms of nine document qualities (accuracy, reliability, objectivity, depth, author/producer credibility, readability, verbosity and conciseness, grammatical correctness, one-sided or multi-view.) Employing statistical analyses, we considered a kind of linear combination, asking (1) if it was possible to combine textual features linearly to predict document qualities; (2) what textual features had good predictive power; (3) what textual features were minimally required for prediction with a detection rate much better than the false alarm rate. We present several promising results, indicating that with a few number of textual features, we can predict various document qualities much better than chance.