Predicting Yelp User Rating using Spark ML library
Assignment 5---Predicting Yelp User Rating using Spark ML library (30
points)
For this assignment, you are to write a spark program to predict Yelp user rating based on their review
Text.
You can download the data files from here:
https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json
and user.json . You can read the description of each file and their attributes here:
https://www.yelp.com/dataset/documentation/main
What you need to do: 1- Data Exploration:
load the review.json file and extract “text” and “stars” attributes
find the distribution of “stars” attributes; that is, find the number of reviews for
each star value.
2- Feature Engineering:
The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5
shows satisfaction. Create a new column “rating” with values 0 (if the star rating
is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you
want to predict.
Find the distribution of the “rating” column; that is, find the count of reviews for
each rating=0 and rating=1. Is the rating attribute balanced? If not, you should
down sample your data. That means, keep the rating value with the lowest
count but take a sample of the reviews for the other category in order to have a
balanced distribution between both classes. This is called stratified sampling and
you can accomplish this in spark using the “sampleBy” method of dataframe.
For an example of stratified sampling you can see here:
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and
here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-
packages/ (the section on stratified sampling)
Unfortunately, the dataset is still too big for our tiny cluster and running ML
models on it can take a long time with our limited resources. So when doing
down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a
sample of only 10% of reviews in each rating category after down-sampling.
Below are the counts I get after stratified sampling (Depending on the seed you
give to samplyBy method, you might get different counts but if you set the seed
to 111, you should get the same distribution as mine)
Extract TFIDF vectors from the review Text. When creating countVectorizer, use
setMinDF(100) to only include words in the feature vector that appear in at
least 100 reviews. Make sure that you remove stop words and punctuations and
use stemming as explained in the labs.
3- Building Machine Learning pipelines.
Use three different machine learning models (Logistic Regression, Random
Forest, and Gradient Boosted Classification Trees) to predict the ratings based
on the TFIDF vector of the review text. For an example of a Gradient Boosteed
Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-
classification-regression.html#gradient-boosted-tree-classifier
Use CrossValidation with three folds and Area Under Curve (AUC) metric to
evaluate and tune each model’s hyper-parameter ( please refer to the labs
posted for this module for examples of cross validation in spark).
Create a separate pipeline for each ML model. Split the data to testing and
training sets, fit each model on the training data, get the predictions for the test
data, and print the AUC for each model.
Explain which model did a better job on predicting the ratings in the test set?
4- Making an ensemble of the above three models:
Typically, an ensemble of multiple models works better than a single model. In
this step, you take the predictions generated by the three models above (that is,
logistic regression and random forest, and gradient boosted classification tree) ,
zip them together and compute a “prediction_ensembled” column which is
basically a majority vote of the three prediction columns generated by each
model. That is, if two or more of the models generated the same prediction,
then use that prediction in the prediction_ensembled column; otherwise, if
none of the predictions are the same, then use the rating value1 for the
prediction_ensemble column. (You can accomplish this in spark sql using a
simple case when query, see an example here:
https://stackoverflow.com/questions/25157451/spark-sql-case-when-then )
Build a dataframe with two columns “prediction_ensemble” and “rating”. Call
this dataframe “ensemble”. Use the following code segment to get the Area
under ROC curve of this ensemble model. Explain whether the Area under ROC
curve of the ensemble model is higher/lower or roughly the same as each of the
individual models?
//Importing the required libraries
import spark.implicits._
import org.apache.spark.mllib.evaluation._
//Converting the ensemble dataframe to an rdd of the form (prediction_ense
mble, rating)
val predictionsAndLabels=ensemble.selectExpr("cast(prediction_ensemble as
Double) prediction_ensemble", "cast(rating as Double) rating").rdd.map(row
=>(row.getAs[Double]("prediction_ensemble"),row.getAs[Double]("rating")))
//Compute the accuracy of the ensemble predictions
val metrics= new BinaryClassificationMetrics(predictionsAndLabels)
println(metrics.areaUnderROC)
5- Adding more Features (optional +3 pts)
Intuitively we know that some users might always give skeptical reviews while others
might always give good reviews. That suggests that adding the average star ratings given
by a user may have some correlations with the target variable, rating. The user.json file
has an attribute average_star which shows the average star ratings for each user id. You
can join this dataset with your rating dataframe to get the average_stars for the user
who gave each rating value.
Now find the correlation between average_stars and rating columns. You can use
stat.corr(col1,col2) method of the dataframe to get correlation between two numeric
columns in the dataframe. You can find an example of the stat.corr method here:
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-
dataframes-in-spark.html .
If the correlation value is significantly above zero then it might suggest a possible
positive correlation between the two columns.
Use vector assembler to make a feature vectore of average_star and the TFIDF vector of
review text. Add this assembler to each ML pipeline you created before and rerun each
pipeline. Explain if adding the average_star to the feature vector increased or decreased
each model’s performance.
Few Notes on Running your code: 1. Be patient! Each model might take a long time to run ( 30mins-1 hour) .
2. You might get a few warnings while running your ml pipeline. That would be fine. You can ignore
warnings
3. If you get an error regarding a yarn container used maximum allowed physical memory, Then
you can change the settings in your yarn-site.xml as follows on all three nodes:
a. Change yarn.scheduler.maximum-allocation-mb to 7000 to allow more maximum
memory per container
b. Add the following property to yarn-site.xml. This property increases the yarn container
off-heap memory usage.
c. In your spylon kernel, set launcher.executor_memory=6000m
What you need to turn in: Your Spylon Notebook (with .ipynb extension). You must add a markdown before every code segment,
explaining what the code segment does and to interpret the results (similar to the labs in this
module). You will not receive a full credit if you don’t have this documentation.
Grading Rubric: Section Points
Feature Engineering: Correctly Compute the rating column
3 pts
Feature Engineering: Correctly Display the distribution of the rating column
3 pts
Correctly downsample the dataset to make the rating column balanced
3 pts
Extracting features from review text( removing stop words and punctuations, stemming, and TFIDF vectors)
5 pts
Creating correct ML pipelines for logistic regression, GBTclassification trees, and random forest.
10 pts
Building the ensemble model and evaluating it 3 pts
Adding Documentation and markdown before every code segment
3 pts
Total 30 pts