Computer Science Homework 2

profilelasonia73
  • 9 months ago
  • 20
files (2)

MMIS671_homework21.docx

Homework 2.

Question 1. Decision Tree Classifier [10 Points]

Data: The zip file “ hw2.q1.data.zip” contains 3 CSV files:

· “ hw2.q1.train.csv” contains 10,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.

· “ hw2.q1.test.csv” contains 2,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.

· “ hw2.q1.new.csv” contains 30 rows and 26 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 25 columns contain input features: x_1, …, x _25.

Task 1. [4 points]

Use 5-fold cross-validation with the 10,000 labeled exampled from “ hw2.q1.train.csv” to determine the fewest number of rules using which a decision tree classifier can achieve mean cross-validation accuracy of at least 0.96. Report the number of rules needed, the cross-validation accuracy obtained, and all the hyper-parameter values for the DecisionTreeClassifier.

Fewest number of rules needed: ………………. (to achieve mean cross-validation accuracy of at least 0.96)

Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)

Non-default hHyper-parameter values for selected DecisionTreeClassifier model:

Task 2. [2 Points]

Train a DecisionTreeClassifier with the hyper-parameter values determined in Task 1 on all 10,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q1.test.csv. Report the following:

· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)

· Classification report for the 2,000 test examples:

· Confusion matrix for the 2,000 test examples:

Task 3. [2 Points]

Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q1.new.csv”. Specify the predicted classes in the table below:

ID

predicted y

1

 

2

 

3

 

4

 

5

 

6

 

7

 

8

 

9

 

10

 

11

 

12

 

13

 

14

 

15

 

16

 

17

 

18

 

19

 

20

 

21

 

22

 

23

 

24

 

25

 

26

 

27

 

28

 

29

 

30

 

Task 4. [2 Points]

Of the 25 input variables which ones are relevant for this classification task?

The following … input variables are relevant for this classification task: …………………

Display your trained decision tree:

Question 2. Supervised machine learning classifiers [10 Points]

Data: The zip file “ hw2.q2.data.zip” contains 3 CSV files:

· “ hw2.q2.train.csv” contains 8,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.

· “ hw2.q2.test.csv” contains 2,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.

· “ hw2.q1.new.csv” contains 30 rows and 10 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 10 columns contain input features: x1, …, x 10.

Task 1. [6 points]

Use 4-fold cross-validation with the 8,000 labeled exampled from “ hw2.q2.train.csv” to identify a classifier that achieves mean cross-validation accuracy of at least 0.96. You should try several Scikit-Learn classifiers, including: GaussianNB, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, KNeighborsClassifier, LogisticRegression, SVC, and MLPClassifier. Try different hyper-parameter values for the better performing classifiers to obtain a good set of hyper-parameter values. Then select the best performing model. Report the following:

Selected model with hyper-parameter values :

Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)

Task 2. [2 Points]

Train the classifier with the hyper-parameter values determined in Task 1 on all 8,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q2.test.csv. Report the following:

· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)

· Classification report for the 2,000 test examples:

· Confusion matrix for the 2,000 test examples:

Task 3. [2 Points]

Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q2.new.csv”. Specify the predicted classes in the table below:

ID

predicted y

ID_001

 

ID_002

 

ID_003

 

ID_004

 

ID_005

 

ID_006

 

ID_007

 

ID_008

 

ID_009

 

ID_010

 

ID_011

 

ID_012

 

ID_013

 

ID_014

 

ID_015

 

ID_016

 

ID_017

 

ID_018

 

ID_019

 

ID_020

 

ID_021

 

ID_022

 

ID_023

 

ID_024

 

ID_025

 

ID_026

 

ID_027

 

ID_028

 

ID_029

 

ID_030