Rapid miner

Ellie_1
PAGA2AssignmentRM-PreviewV3-02.pdf

Module 1 Workshops and Assignment Copyright © Jacob L. Cybulski

A ssi g

n m

e n t A

2 P

re vie

w :

R M

C lu

s te r+

N N

+ T e x t

A ssi g

n m

e n t A

2 P

re vie

w :

R M

C lu

s te r+

N N

+ T e x t

Australian Wine Importers (AWI) asked you to develop a method of estimating rating (points) of imported wines based on their text and structured attributes.

AWI provided you with a sample of 130,000 wine tasting results, which include:

 Wine “title” (name + vintage);  Country, Province and Region;  Variety and Winery;  Description and Designation;  Price (US$)

However:

Taster name and Points to be excluded.

In the future, AWI would like to get the preliminary insight as to the wine quality based on social media reviews. The following questions are of interests to AWI:

A. What group of wines the new wine is most similar to, and why / how?

B. What is the estimated rating of the newly introduced wine to the Australian market? (fractional ratings permitted)

AWI wants you to cleanup and explore wine tasting data, develop and evaluate a wine rating estimator, and minimize the estimation error in the process.

11

The following mini-case study will be used in assignment A2. Data: www.deakin.edu.au/~jlcybuls/pred/data/Wine-Reviews.zip Source: https://www.kaggle.com/zynicide/wine-reviews

Part LP4

Exec: Create a problem definition and write a brief spec of its possible solution.

Model: Create at least these two models, i.e. (M1) decision trees and (M2) neural nets. Ensure your solution considers three types of models, which are based on (A1) structured data only, (A2) text data only, (A3) a mix of structured and text data. Describe operators properties. Optionally create model ensembles. Optionally utilise clusters and deal with anomalies, use PCA in their visualisation. Optionally answer question (B).

Validate & Optimise: Optimise the models’ performance to minimise overall error in ratings. Compare performance of all models (including ensembles), using R2, correlation and others. Visualise optimisation results. Optionally use grid optimisation.

Solution: Create a quality deployment process. Score the best model, and demonstrate how to apply the model to new data.

Extend: Conduct research and use novel data mining approaches.

Tasks and Deliverables

Part LP3

Exec: Briefly define a problem in business terms.

Rels: Perform cluster analysis of wines’ text. Conduct segmentation analysis, including both text and structured data. Identify relationships in data. Visualise and interpret results. Answer the management question (A).

  • Slide 1