Project 2

mezz
Project_FINAL211.docx

https://www.seu.edu.sa/sites/ar/SitePages/images/logo.png

College of Computing and Informatics

Text Classification

Create an IPython notebook that answers the following questions. Any diagram should be plotted in the notebook and copied to the report for analysis. In the report, include descriptions, discussions …etc.

Dataset:

Yelp review dataset.

Exploratory Data Analysis

· What are the top three businesses that have the most frequent five-star ratings (0.37)? Plot the counts of positive (4-5) and negative reviews (1-2) for each of these businesses (0.37). (0.75 mark)

· Do positive ratings (4-5 ratings) tend to be cool, useful and funny more than negative ratings (1-2 ratings)? (0.75 mark)

Data cleaning and preprocessing

· Clean the review texts as you see fit and provide justification of your decisions (1 mark for cleansing and 1 for justifications). For example, if you decided to not remove emoticons, you should explain why. Note that the data cleansing process has to be comprehensive. (2 mark)

· Create three word-clouds for all reviews, positive and negative reviews (0.16 each) (0.5 mark)

· Use vector space model to represent reviews (0.5) and report the top ten most frequent words in the training set (0.5) (1 mark)

Model development

· Convert the ratings into positive (4-5), neutral (3) and negative (1-2). Develop and compare at least three text classifiers to predict the sentiment of reviews based on their texts (1 mark). You are expected to perform hyperparameter tuning and choose the best combination (1 mark). (2 mark)

· Choose the best performing model and analyze its results (1 mark). Compared to the least performing classifier, are the results statistically significant (1 mark)? (2 mark)

· Novelty and creativity (1 mark)