Data mining lab assignment
Lab Assignment 3
CIS660/EEC 525
Sunnie Chung
Designing and Building a Prediction Model for Bike Buyer Data with a Classifier
Choose any two classifiers covered in class and apply to your Bike Buyer data Set Plan your experiment with:
1. Determine Data preprocessing methods required to apply for each of your classifiers
2. For each classifier, 2-1. Compare the accuracy of the classifier with two different sets of input
parameters if applicable Or 2-2. Compare the accuracy of the classifier with two different data
preprocessing methods. Optional for Extra Credit 2-3. Experiment for Feature Selection with PCA tools or Your Own Experiment
(See Below for an example)
3. Compare the accuracy of each test of the classifiers
4. Discuss about your results: Why your inducted model is different for the same training data as you
change the parameter values or the classifier.
Why a certain parameter setting or a classifier shows with better accuracy than the others that you tried.
Anything you observed
Data:
Use your data VTargetMail from Lab1 that you selected and preprocessed for the Training and Test Sets for Lab3
For EEC 525 Students, you can choose your data source from any sensor data set or machine generated signals
Phases:
1. Determine Data preprocessing methods to apply for each of your classifiers
For example, Discretization for Decision Tree Vectorization of a record for SVM Normalization for Neural Network
2. Design your Data Analytic Experiment with Two different Classifiers of Your Choice. Choose any two different classifiers covered in class, for example, Decision Tree, Naïve Bayesian, SVM, Neural Network, K Nearest Neighbor, or any other classifier to compare the Accuracy of the results from your classifier. 2-1. Experiment to Find the Best Parameter Setting for your Classifier.
For Example: Decision Tree Classifier: C5 for GainRatioSplit, CART for GiniSplit on the same set of data with different parameter settings as follow: Measure: Entropy, GINI Different Minimum Support Thresholds Different Complex Penalty Degrees on the Number of Splits
Neural Network:
Test with two Different Topologies: The number units of a hidden layer, The number of hidden layers
SVM: Test with different Kernel functions
K Nearest Neighbor: Test with two different K values and distance metrics
Or alternatively
2-2. For Naïve Bayes, NN or SVM, Experiment with two different Data
Transformation Methods
For Continuous and numeric Attributes, 1) Data set as floating point without Discretization and Binarization 2) Data set with Discretization and Binarization
For Extra Credit
2-3 Experiment for Feature Selection with either
1) Feature Significance Analysis with PCA tools Or 2) Your Own Experiment as follow:
Simple Experiment for Feature Selection Methodology 2-2-1. Pick the best parameter setting and data transformation from Phase 2.
2-2-2. Apply Your Classifier with the best parameters set to each different feature sets from your input file to see if there is any significant difference in the result for each iteration. (See Below for an example)
3. Validate your result with your Test Set to compare the Accuracy of your models for each classifier with different Parameter settings or different transformation method.
4. Discuss about your results: Why your inducted model is different for the same training data as you
change the parameter values or the classifier. Why a classifier shows better accuracy than the others for a certain
parameter setting or with a different transformation method. Any observations you made
5. Extra Credit to Anyone that Gets Best Top 5 Accuracy of the Class
Available Platforms:
You can use any data analytic systems/tools of your choice.
Some suggested platforms:
Python Data Science Platforms:
(See the beginning of Lab Section for a Full list of Data Science Platforms)
• Anaconda
https://www.anaconda.com/open-source
(See Fundamental Section for List of Data Science Platforms)
Anaconda Tutorials:
https://docs.anaconda.com/anaconda/navigator/tutorials/
• Any available Classifiers as Open Source:
For example, C5 or CART for Decision Tree Download C5 and CART at: http://www.rulequest.com/see5-info.html http://www.salford-systems.com/downloadspm
Other Java Based Data Mining Platform site:
http://www.cs.waikato.ac.nz/~ml/weka/
Extra Credit: Experiment for Feature Selection with either
Feature Significance Analysis with PCA tools (See the PCA Example in Lecture Note Section for this) Or Your Own Experiment as follow: Simple Experiment for Feature Selection Methodology
1. Simple Experiment for Feature Selection Methodology to choose the best feature set: 1-1 Pick the best Model with the best parameter setting from Phase 1 and 2. 1-2 Apply your Model with the best parameters to different input sets (created with different combinations of feature sets from your VTargetMail input file to see if there are any significant differences in the result of each feature set in terms of Accuracy. See Lecture Note Slide 75 on DW or Slide 34 in J Han’s Chap 3 Data Exploration as below:
Submission:
1. Screen Captures of your Installation/Setting up Procedure and document the related Source info (Which software, Link to the Site, Which Classifier Algorithm, etc).
2. Document your experiments with all the steps for each classifier 3. Document your models if applicable with each the different parameter settings or
different transformation methods and the result in Accuracy 4. Report your discussion, observation, findings on Your Results 5. Grade will be based on completion of the required tasks and Accuracy
(Performance) of your classifiers 6. Put a Note in the Cover Page if you did the Extra Credit Experiment in your Lab3
Report.