Computer science

DatasetslistfromTextbook.docx

Technology Insights 7.2 Large Textual Data Sets for Predictive

Text Mining and Sentiment Analysis

Congressional Floor-Debate Transcripts: Published by Thomas et al. (Thomas

and B. Pang, 2006); contains political speeches that are labeled to indicate whether the

speaker supported or opposed the legislation discussed.

Economining: Published by Stern School at New York University; consists of feedback

postings for merchants at Amazon.com.

Cornell Movie-Review Data Sets: Introduced by Pang and Lee (Pang and Lee,

2008); contains 1,000 positive and 1,000 negative automatically derived document-level

labels, and 5,331 positive and 5,331 negative sentences/snippets.

Stanford—Large Movie Review Data Set: A set of 25,000 highly polar movie

reviews for training, and 25,000 for testing. There is additional unlabeled data for use as

well. Raw text and already processed bag-of-words formats are provided. (See: http://

ai.stanford.edu/~amaas/data/sentiment.)

MPQA Corpus: Corpus and Opinion Recognition System corpus; contains 535 manually

annotated news articles from a variety of news sources containing labels for opinions

and private states (beliefs, emotions, speculations, etc.).

Multiple-Aspect Restaurant Reviews: Introduced by Snyder and Barzilay (Snyder

and Barzilay, 2007); contains 4,488 reviews with an explicit 1-to-5 rating