Computer Science Homework 2
9 months ago
20
MMIS671_homework21.docx
- hw2-1.q2.data1.zip
MMIS671_homework21.docx
Homework 2.
Question 1. Decision Tree Classifier [10 Points]
Data: The zip file “ hw2.q1.data.zip” contains 3 CSV files:
· “ hw2.q1.train.csv” contains 10,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.
· “ hw2.q1.test.csv” contains 2,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.
· “ hw2.q1.new.csv” contains 30 rows and 26 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 25 columns contain input features: x_1, …, x _25.
Task 1. [4 points]
Use 5-fold cross-validation with the 10,000 labeled exampled from “ hw2.q1.train.csv” to determine the fewest number of rules using which a decision tree classifier can achieve mean cross-validation accuracy of at least 0.96. Report the number of rules needed, the cross-validation accuracy obtained, and all the hyper-parameter values for the DecisionTreeClassifier.
Fewest number of rules needed: ………………. (to achieve mean cross-validation accuracy of at least 0.96)
Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)
Non-default hHyper-parameter values for selected DecisionTreeClassifier model:
Task 2. [2 Points]
Train a DecisionTreeClassifier with the hyper-parameter values determined in Task 1 on all 10,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q1.test.csv” . Report the following:
· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)
· Classification report for the 2,000 test examples:
· Confusion matrix for the 2,000 test examples:
Task 3. [2 Points]
Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q1.new.csv”. Specify the predicted classes in the table below:
|
ID |
predicted y |
|
1 |
|
|
2 |
|
|
3 |
|
|
4 |
|
|
5 |
|
|
6 |
|
|
7 |
|
|
8 |
|
|
9 |
|
|
10 |
|
|
11 |
|
|
12 |
|
|
13 |
|
|
14 |
|
|
15 |
|
|
16 |
|
|
17 |
|
|
18 |
|
|
19 |
|
|
20 |
|
|
21 |
|
|
22 |
|
|
23 |
|
|
24 |
|
|
25 |
|
|
26 |
|
|
27 |
|
|
28 |
|
|
29 |
|
|
30 |
|
Task 4. [2 Points]
Of the 25 input variables which ones are relevant for this classification task?
The following … input variables are relevant for this classification task: …………………
Display your trained decision tree:
Question 2. Supervised machine learning classifiers [10 Points]
Data: The zip file “ hw2.q2.data.zip” contains 3 CSV files:
· “ hw2.q2.train.csv” contains 8,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.
· “ hw2.q2.test.csv” contains 2,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.
· “ hw2.q1.new.csv” contains 30 rows and 10 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 10 columns contain input features: x1, …, x 10.
Task 1. [6 points]
Use 4-fold cross-validation with the 8,000 labeled exampled from “ hw2.q2.train.csv” to identify a classifier that achieves mean cross-validation accuracy of at least 0.96. You should try several Scikit-Learn classifiers, including: GaussianNB, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, KNeighborsClassifier, LogisticRegression, SVC, and MLPClassifier. Try different hyper-parameter values for the better performing classifiers to obtain a good set of hyper-parameter values. Then select the best performing model. Report the following:
Selected model with hyper-parameter values :
Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)
Task 2. [2 Points]
Train the classifier with the hyper-parameter values determined in Task 1 on all 8,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q2.test.csv” . Report the following:
· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)
· Classification report for the 2,000 test examples:
· Confusion matrix for the 2,000 test examples:
Task 3. [2 Points]
Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q2.new.csv”. Specify the predicted classes in the table below:
|
ID |
predicted y |
|
ID_001 |
|
|
ID_002 |
|
|
ID_003 |
|
|
ID_004 |
|
|
ID_005 |
|
|
ID_006 |
|
|
ID_007 |
|
|
ID_008 |
|
|
ID_009 |
|
|
ID_010 |
|
|
ID_011 |
|
|
ID_012 |
|
|
ID_013 |
|
|
ID_014 |
|
|
ID_015 |
|
|
ID_016 |
|
|
ID_017 |
|
|
ID_018 |
|
|
ID_019 |
|
|
ID_020 |
|
|
ID_021 |
|
|
ID_022 |
|
|
ID_023 |
|
|
ID_024 |
|
|
ID_025 |
|
|
ID_026 |
|
|
ID_027 |
|
|
ID_028 |
|
|
ID_029 |
|
|
ID_030 |
|
- 700 word essay
- coll 100
- #1 opinion of module 3.1
- HIM 105 week2
- Finding Your Best Bank ***********Goodwriter Only************
- For Ann Harris
- Two assignments (SAME TOPIC). 6 total pages. Due 7/26/2017! Health care management. MUST BE 100% PLAGARISM FREE AND ON TIME!!!
- Musical research
- History Project
- 2 papers analysis