Computer science homework 2

MMIS671_homework21.docx

Homework 2.

Question 1. Decision Tree Classifier [10 Points]

Data: The zip file “ hw2.q1.data.zip” contains 3 CSV files:

· “ hw2.q1.train.csv” contains 10,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.

· “ hw2.q1.test.csv” contains 2,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.

· “ hw2.q1.new.csv” contains 30 rows and 26 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 25 columns contain input features: x_1, …, x _25.

Task 1. [4 points]

Use 5-fold cross-validation with the 10,000 labeled exampled from “ hw2.q1.train.csv” to determine the fewest number of rules using which a decision tree classifier can achieve mean cross-validation accuracy of at least 0.96. Report the number of rules needed, the cross-validation accuracy obtained, and all the hyper-parameter values for the DecisionTreeClassifier.

Fewest number of rules needed: ………………. (to achieve mean cross-validation accuracy of at least 0.96)

Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)

Non-default hHyper-parameter values for selected DecisionTreeClassifier model:

Task 2. [2 Points]

Train a DecisionTreeClassifier with the hyper-parameter values determined in Task 1 on all 10,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q1.test.csv” . Report the following:

· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)

· Classification report for the 2,000 test examples:

· Confusion matrix for the 2,000 test examples:

Task 3. [2 Points]

Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q1.new.csv”. Specify the predicted classes in the table below:

ID	predicted y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Task 4. [2 Points]

Of the 25 input variables which ones are relevant for this classification task?

The following … input variables are relevant for this classification task: …………………

Display your trained decision tree:

Question 2. Supervised machine learning classifiers [10 Points]

Data: The zip file “ hw2.q2.data.zip” contains 3 CSV files:

· “ hw2.q2.train.csv” contains 8,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.

· “ hw2.q2.test.csv” contains 2,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.

· “ hw2.q1.new.csv” contains 30 rows and 10 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 10 columns contain input features: x1, …, x 10.

Task 1. [6 points]

Use 4-fold cross-validation with the 8,000 labeled exampled from “ hw2.q2.train.csv” to identify a classifier that achieves mean cross-validation accuracy of at least 0.96. You should try several Scikit-Learn classifiers, including: GaussianNB, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, KNeighborsClassifier, LogisticRegression, SVC, and MLPClassifier. Try different hyper-parameter values for the better performing classifiers to obtain a good set of hyper-parameter values. Then select the best performing model. Report the following:

Selected model with hyper-parameter values :

Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)

Task 2. [2 Points]

Train the classifier with the hyper-parameter values determined in Task 1 on all 8,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q2.test.csv” . Report the following:

· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)

· Classification report for the 2,000 test examples:

· Confusion matrix for the 2,000 test examples:

Task 3. [2 Points]

Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q2.new.csv”. Specify the predicted classes in the table below:

ID	predicted y
ID_001
ID_002
ID_003
ID_004
ID_005
ID_006
ID_007
ID_008
ID_009
ID_010
ID_011
ID_012
ID_013
ID_014
ID_015
ID_016
ID_017
ID_018
ID_019
ID_020
ID_021
ID_022
ID_023
ID_024
ID_025
ID_026
ID_027
ID_028
ID_029
ID_030

MMIS671_homework21.docx

Homework 2.

Question 1. Decision Tree Classifier [10 Points]

Data: The zip file “ hw2.q1.data.zip” contains 3 CSV files:

· “ hw2.q1.train.csv” contains 10,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.

· “ hw2.q1.test.csv” contains 2,000 rows and 26 columns. The first column ‘ y’ is the output variable with 2 classes: 0, 1. The remaining 25 columns contain input features: x_1, …, x _25.

· “ hw2.q1.new.csv” contains 30 rows and 26 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 25 columns contain input features: x_1, …, x _25.

Task 1. [4 points]

Use 5-fold cross-validation with the 10,000 labeled exampled from “ hw2.q1.train.csv” to determine the fewest number of rules using which a decision tree classifier can achieve mean cross-validation accuracy of at least 0.96. Report the number of rules needed, the cross-validation accuracy obtained, and all the hyper-parameter values for the DecisionTreeClassifier.

Fewest number of rules needed: ………………. (to achieve mean cross-validation accuracy of at least 0.96)

Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)

Non-default hHyper-parameter values for selected DecisionTreeClassifier model:

Task 2. [2 Points]

Train a DecisionTreeClassifier with the hyper-parameter values determined in Task 1 on all 10,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q1.test.csv” . Report the following:

· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)

· Classification report for the 2,000 test examples:

· Confusion matrix for the 2,000 test examples:

Task 3. [2 Points]

Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q1.new.csv”. Specify the predicted classes in the table below:

ID	predicted y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Task 4. [2 Points]

Of the 25 input variables which ones are relevant for this classification task?

The following … input variables are relevant for this classification task: …………………

Display your trained decision tree:

Question 2. Supervised machine learning classifiers [10 Points]

Data: The zip file “ hw2.q2.data.zip” contains 3 CSV files:

· “ hw2.q2.train.csv” contains 8,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.

· “ hw2.q2.test.csv” contains 2,000 rows and 11 columns. The first column ‘ y’ is the output variable with 4 classes: 0, 1, 2, 3. The remaining 10 columns contain input features: x1, …, x 10.

· “ hw2.q1.new.csv” contains 30 rows and 10 columns. The first column ‘ ID’ is an identifier for 30 unlabeled samples. The remaining 10 columns contain input features: x1, …, x 10.

Task 1. [6 points]

Use 4-fold cross-validation with the 8,000 labeled exampled from “ hw2.q2.train.csv” to identify a classifier that achieves mean cross-validation accuracy of at least 0.96. You should try several Scikit-Learn classifiers, including: GaussianNB, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, KNeighborsClassifier, LogisticRegression, SVC, and MLPClassifier. Try different hyper-parameter values for the better performing classifiers to obtain a good set of hyper-parameter values. Then select the best performing model. Report the following:

Selected model with hyper-parameter values :

Mean cross-validation accuracy: ………………………. ( rounded to 4 decimal places)

Task 2. [2 Points]

Train the classifier with the hyper-parameter values determined in Task 1 on all 8,000 training samples and use it to predict the output class ‘ y’ for the 2,000 examples in “ hw2.q2.test.csv” . Report the following:

· Accuracy on 2,000 test examples: …………………… (rounded to 4 decimal places)

· Classification report for the 2,000 test examples:

· Confusion matrix for the 2,000 test examples:

Task 3. [2 Points]

Use the model trained in Task 2 to predict the output class ‘ y’ for the 30 examples in “ hw2.q2.new.csv”. Specify the predicted classes in the table below:

ID	predicted y
ID_001
ID_002
ID_003
ID_004
ID_005
ID_006
ID_007
ID_008
ID_009
ID_010
ID_011
ID_012
ID_013
ID_014
ID_015
ID_016
ID_017
ID_018
ID_019
ID_020
ID_021
ID_022
ID_023
ID_024
ID_025
ID_026
ID_027
ID_028
ID_029
ID_030

MMIS671_homework21.docx

MMIS671_homework21.docx

MMIS671_homework21.docx