Computer science
Technology Insights 7.2 Large Textual Data Sets for Predictive
Text Mining and Sentiment Analysis
Congressional Floor-Debate Transcripts: Published by Thomas et al. (Thomas
and B. Pang, 2006); contains political speeches that are labeled to indicate whether the
speaker supported or opposed the legislation discussed.
Economining: Published by Stern School at New York University; consists of feedback
postings for merchants at Amazon.com.
Cornell Movie-Review Data Sets: Introduced by Pang and Lee (Pang and Lee,
2008); contains 1,000 positive and 1,000 negative automatically derived document-level
labels, and 5,331 positive and 5,331 negative sentences/snippets.
Stanford—Large Movie Review Data Set: A set of 25,000 highly polar movie
reviews for training, and 25,000 for testing. There is additional unlabeled data for use as
well. Raw text and already processed bag-of-words formats are provided. (See: http://
ai.stanford.edu/~amaas/data/sentiment.)
MPQA Corpus: Corpus and Opinion Recognition System corpus; contains 535 manually
annotated news articles from a variety of news sources containing labels for opinions
and private states (beliefs, emotions, speculations, etc.).
Multiple-Aspect Restaurant Reviews: Introduced by Snyder and Barzilay (Snyder
and Barzilay, 2007); contains 4,488 reviews with an explicit 1-to-5 rating