2013 Annual Meeting
Montréal, Québec, Canada | November 1-5, 2013
Andrew Yates, Georgetown University
Nazli Goharian, Georgetown University
Wai Gen Yee, Orbitz Worldwide
Document level sentiment analysis, the task of determining whether the sentiment expressed in a document is positive or negative, is commonly performed by supervised methods. As with all supervised tasks, obtaining training data for these methods can be expensive and time-consuming. Some semi-supervised approaches have been proposed that rely on sentiment lexicons. We propose a novel supervised and a novel semi-supervised sentiment analysis method that are both based on a probabilistic graphical model, without requiring any lexicon. Our semi-supervised method takes advantage of the numerical ratings that are often included in online reviews (e.g., 4 out of 5 stars). While these numerical ratings are related to sentiment, they are noisy and hence, by themselves, they are an imperfect indicator of reviews’ sentiments. We incorporate unlabeled user reviews as training data by treating the reviews’ numerical ratings as sentiment labels while modeling the ratings’ noisy nature. Our empirical results, utilizing a corpus of labeled sentences from hotel reviews and unlabeled hotel reviews with numerical ratings, show that treating reviews’ ratings as noisy and utilizing them to augment a small amount of labeled sentences outperforms strong existing supervised and semi-supervised classification-based and lexicon-based approaches.