Data Mining - Computer Science
Problem 1
Consider the following training dataset with fifteen entries. Each entry has answers to a series of questions that ask if they liked a certain type of food, in which the participant answered (1) for yes or (0) for now. The last column(“midwest?”) is our target column, meaning that once the decision tree is built, this is the classification we are trying to guess, i.e.., if a person is from Midwest.
Create the entire decision tree for this dataset using Information Gain as the attribute selection measure. Make sure that you provide me with the entropy and the information gain for the attributes at each partitioning step and highlight attribute and its value that you chose at each step to partition the dataset like I did in the example that I provided.
Problem 2
Consider the below dataset.
|
age |
income |
student |
credit_rating |
buys_computer |
|
<=30 |
high |
no |
fair |
no |
|
<=30 |
high |
no |
excellent |
no |
|
31…40 |
high |
no |
fair |
yes |
|
>40 |
medium |
no |
fair |
yes |
|
>40 |
low |
yes |
fair |
yes |
|
>40 |
low |
yes |
excellent |
no |
|
31…40 |
low |
yes |
excellent |
yes |
|
<=30 |
medium |
no |
fair |
no |
|
<=30 |
low |
yes |
fair |
yes |
|
>40 |
medium |
yes |
fair |
yes |
|
<=30 |
medium |
yes |
excellent |
yes |
|
31…40 |
medium |
no |
excellent |
yes |
|
31…40 |
high |
yes |
fair |
yes |
|
>40 |
medium |
no |
excellent |
no |
Classify the following data-point using Naïve Bayesian method. Show all relevant calculations similar to the example given in the slide.
X = (age > 40, Income = medium, Student = no, Credit_rating = Fair)
Problem 3
Consider the following confusion matrix
|
Actual Class \ Predicted Class |
Cancer = yes |
Cancer = no |
Total |
|
Cancer = yes |
90 |
210 |
300 |
|
Cancer = no |
140 |
9560 |
9700 |
|
Total |
230 |
9770 |
10000 |
Find the following:
a. Accuracy
b. Sensitivity
c. Specificity
d. Precision
e. Recall
f. F1 measure