Predicting Yelp User Rating using Spark ML library

sandy9889
Assignment5-sparkML-FA183.pdf

Assignment 5---Predicting Yelp User Rating using Spark ML library (30

points)

For this assignment, you are to write a spark program to predict Yelp user rating based on their review

Text.

You can download the data files from here:

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json

and user.json . You can read the description of each file and their attributes here:

https://www.yelp.com/dataset/documentation/main

What you need to do: 1- Data Exploration:

 load the review.json file and extract “text” and “stars” attributes

 find the distribution of “stars” attributes; that is, find the number of reviews for

each star value.

2- Feature Engineering:

 The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5

shows satisfaction. Create a new column “rating” with values 0 (if the star rating

is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you

want to predict.

 Find the distribution of the “rating” column; that is, find the count of reviews for

each rating=0 and rating=1. Is the rating attribute balanced? If not, you should

down sample your data. That means, keep the rating value with the lowest

count but take a sample of the reviews for the other category in order to have a

balanced distribution between both classes. This is called stratified sampling and

you can accomplish this in spark using the “sampleBy” method of dataframe.

For an example of stratified sampling you can see here:

http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and

here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-

packages/ (the section on stratified sampling)

 Unfortunately, the dataset is still too big for our tiny cluster and running ML

models on it can take a long time with our limited resources. So when doing

down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a

sample of only 10% of reviews in each rating category after down-sampling.

Below are the counts I get after stratified sampling (Depending on the seed you

give to samplyBy method, you might get different counts but if you set the seed

to 111, you should get the same distribution as mine)

 Extract TFIDF vectors from the review Text. When creating countVectorizer, use

setMinDF(100) to only include words in the feature vector that appear in at

least 100 reviews. Make sure that you remove stop words and punctuations and

use stemming as explained in the labs.

3- Building Machine Learning pipelines.

 Use three different machine learning models (Logistic Regression, Random

Forest, and Gradient Boosted Classification Trees) to predict the ratings based

on the TFIDF vector of the review text. For an example of a Gradient Boosteed

Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-

classification-regression.html#gradient-boosted-tree-classifier

 Use CrossValidation with three folds and Area Under Curve (AUC) metric to

evaluate and tune each model’s hyper-parameter ( please refer to the labs

posted for this module for examples of cross validation in spark).

Create a separate pipeline for each ML model. Split the data to testing and

training sets, fit each model on the training data, get the predictions for the test

data, and print the AUC for each model.

Explain which model did a better job on predicting the ratings in the test set?

4- Making an ensemble of the above three models:

 Typically, an ensemble of multiple models works better than a single model. In

this step, you take the predictions generated by the three models above (that is,

logistic regression and random forest, and gradient boosted classification tree) ,

zip them together and compute a “prediction_ensembled” column which is

basically a majority vote of the three prediction columns generated by each

model. That is, if two or more of the models generated the same prediction,

then use that prediction in the prediction_ensembled column; otherwise, if

none of the predictions are the same, then use the rating value1 for the

prediction_ensemble column. (You can accomplish this in spark sql using a

simple case when query, see an example here:

https://stackoverflow.com/questions/25157451/spark-sql-case-when-then )

 Build a dataframe with two columns “prediction_ensemble” and “rating”. Call

this dataframe “ensemble”. Use the following code segment to get the Area

under ROC curve of this ensemble model. Explain whether the Area under ROC

curve of the ensemble model is higher/lower or roughly the same as each of the

individual models?

//Importing the required libraries

import spark.implicits._

import org.apache.spark.mllib.evaluation._

//Converting the ensemble dataframe to an rdd of the form (prediction_ense

mble, rating)

val predictionsAndLabels=ensemble.selectExpr("cast(prediction_ensemble as

Double) prediction_ensemble", "cast(rating as Double) rating").rdd.map(row

=>(row.getAs[Double]("prediction_ensemble"),row.getAs[Double]("rating")))

//Compute the accuracy of the ensemble predictions

val metrics= new BinaryClassificationMetrics(predictionsAndLabels)

println(metrics.areaUnderROC)

5- Adding more Features (optional +3 pts)

 Intuitively we know that some users might always give skeptical reviews while others

might always give good reviews. That suggests that adding the average star ratings given

by a user may have some correlations with the target variable, rating. The user.json file

has an attribute average_star which shows the average star ratings for each user id. You

can join this dataset with your rating dataframe to get the average_stars for the user

who gave each rating value.

Now find the correlation between average_stars and rating columns. You can use

stat.corr(col1,col2) method of the dataframe to get correlation between two numeric

columns in the dataframe. You can find an example of the stat.corr method here:

https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-

dataframes-in-spark.html .

If the correlation value is significantly above zero then it might suggest a possible

positive correlation between the two columns.

 Use vector assembler to make a feature vectore of average_star and the TFIDF vector of

review text. Add this assembler to each ML pipeline you created before and rerun each

pipeline. Explain if adding the average_star to the feature vector increased or decreased

each model’s performance.

Few Notes on Running your code: 1. Be patient! Each model might take a long time to run ( 30mins-1 hour) .

2. You might get a few warnings while running your ml pipeline. That would be fine. You can ignore

warnings

3. If you get an error regarding a yarn container used maximum allowed physical memory, Then

you can change the settings in your yarn-site.xml as follows on all three nodes:

a. Change yarn.scheduler.maximum-allocation-mb to 7000 to allow more maximum

memory per container

b. Add the following property to yarn-site.xml. This property increases the yarn container

off-heap memory usage.

c. In your spylon kernel, set launcher.executor_memory=6000m

What you need to turn in: Your Spylon Notebook (with .ipynb extension). You must add a markdown before every code segment,

explaining what the code segment does and to interpret the results (similar to the labs in this

module). You will not receive a full credit if you don’t have this documentation.

Grading Rubric: Section Points

Feature Engineering: Correctly Compute the rating column

3 pts

Feature Engineering: Correctly Display the distribution of the rating column

3 pts

Correctly downsample the dataset to make the rating column balanced

3 pts

Extracting features from review text( removing stop words and punctuations, stemming, and TFIDF vectors)

5 pts

Creating correct ML pipelines for logistic regression, GBTclassification trees, and random forest.

10 pts

Building the ensemble model and evaluating it 3 pts

Adding Documentation and markdown before every code segment

3 pts

Total 30 pts