data mining and neutral networks using python

profileManishag
209052388.pdf

Data Mining and Neural Networks

Computational Task 1

Task 1 a. What is the problem authors aimed to solve?

Authors aimed to distinguish malignant from benign breast cancer, using nuclear size, shape, and

texture as features.

b. Which methods did they use?

The authors used Inductive machine learning and logistic regression to correctly label malignant or

benign.

c. How did they test the accuracy of classification?

The authors used Cross-validation to test the accuracy of the predicted results. The accuracy of

logistic regression was 96.2% whereas the accuracy of inductive machine learning was 97.5%.

Task 2 For task 2, the data table from ics.uci.edu was downloaded as wdbc.data file. Here there are in total 32

columns with 1 ID column, 1 Diagnosis column and 30 attribute columns. Here the 30 are divided into 3

groups of mean, standard error, and worst radii. There are 212 malignant cases (M) and 357 benign cases

(B) as shown in the Figure 1.

Figure 1. Number of features and count of each target class

The following are the mean, variance and standard deviation of all attributes starting from column 3-32

shown in the Figure 2. These are calculated before normalizing the attributes to unit variance.

Figure 2. Mean, Variance and Standard Deviation of each attribute (0-29)

The following are the mean, variance, and standard deviation of all attributes for Malignant class (M) in the

Figure 3.

Figure 3. Mean, Variance and Standard Deviation of each attribute for Malignant class (M)

The following are the mean, variance, and standard deviation of all attributes for Benign class (B) in the

Figure 4.

Figure 4. Mean, Variance and Standard Deviation of each attribute for Benign class (B)

The attributes are not normalized as we can tell based on the mean, variance, and standard deviations. To

normalize we will subtract the mean of each attribute from each value of the attribute to get zero mean

and we divide it with the standard deviation to get unit variance as shown in the Figure 5.

Figure 5. Mean and standard deviation after normalization

Task 3

To create predictors by one attribute, we plotted histograms for each attribute and each class. Following

are some of the histograms shown in Figure 6.

Figure 6. Histogram plots of first 4 columns

To calcuate the optimal threshold for each single attribute classifier, we have set the threshold from 0-20

(bins) and calcuated the accuracy and specificity. Here we chose the threshold that maximizes the

accuracy. The following are the thresholds of each single attribute classifier shown in the Figure 7.

Figure 7. Optimal Thresholds of all single attribute classifiers sorted by accuracy

From Figure 7, we can determine that attribute ‘20’ gives the best accuracy with least classification errors.

The following are some of the classification rules:

Attribute Accuracy Error Threshold Classification Rule

20 89.99% 10.03% 16 If x <= 16 then Class B else Class M

0 89.39% 10.60% 15 If x <= 15 then Class B else Class M

12 80.63% 19.36% 3 If x > 3 then Class M else Class B Table 1. Classification rules of the top 3 single attribute classifiers

Task 4

To test 1NN and 3NN classification rules, we normalized the values to zero mean and unit variance. We

also divided the dataset into 60% training data and 40% test data to test the classification accuracy and

error. The following Figure 8 shows the accuracy of both 1NN and 3NN classifiers.

Figure 8. Accuracy of 1NN and 3NN classifiers

The Figure 9 shows the classification errors of both 1NN and 3NN classifiers

Figure 9. Classification errors of 1NN and 3NN classifiers

Based on this, 3NN has more accuracy than compared to 1NN classifier, hence 3NN classifier is better in

classifying the malignant vs benign cancers.

Class 1

Class 2

Task 5

Fischer’s linear discriminant is used to obtain a hyperplane which optimizes the signal-to-noise ratio or the

hyperplane that maximizes the distance between means of projected instances and minimized the

variance among the projected instances of each class. That is, it tries to find the hyperplane that reduces

the distance between two groups of projects instances and in which the groups are closely packed with

one another.

Figure 10. Shows a hyperplane that divides all projections clearly

Here the projections of all data points on the hyperplane are well separated and the projections are also

closely packed. This allows us to take a normal to the hyperplane and classify.

Fisher’s Linear Discriminant hence finds the hyperplane by maximizing the following ratio:

Here (w⃗) is normal to the hyperplane.

Task 6

Applied Fisher’s linear discriminant to the Breast Cancer Wisconsin (Diagnostic) data set using sklearn’s

LinearDiscriminantAnalysis classifier. Figure 11 shows the accuracy of Fisher’s classifier:

Figure 11. Accuracy of Fisher’s Linear Discriminant

Figure 12 shows the confusion matrix and classification errors:

Figure 12. Confusion matrix and classification errors of Fischer’s Classifier

Compared the 1NN, this method provided more accuracy but on par with the accuracy of 3NN methods.

Appendix

1. # Import statements

import pandas as pd

import numpy as np

from sklearn import preprocessing

from matplotlib import pyplot

2. # Data import

headers = ['ID', 'Diagnosis']

headers.extend([str(i) for i in range(30)])

data =  pd.read_csv('wdbc.data', sep=",", header=None, names=headers)

data

3. # Stats

attributes = data.shape[1] - 2 # remove id and class count

benign, malignant = 0, 0

for index, row in data.iterrows():

    if row[1] == 'M':

        malignant += 1

    elif row[1] == 'B':

        benign += 1

    else:

        print(row[1])

print("There are {} attributes".format(attributes))

print("There are {} malignant cases (M) and {} benign cases (B)".format(malignant

, benign))

4. # mean variance and standard deviation: All classes

all_means = []

all_std_deviations = []

all_variations = []

for column in data.columns[2:]:

    all_means.append(data[column].mean())

    all_std_deviations.append(data[column].std())

    all_variations.append(data[column].var())

pd.DataFrame({'Mean': all_means, 'Variance': all_variations, 'Standard Deviation': 

all_std_deviations})

5. # mean variance and standard deviation: Class: malignant

malignant_means = []

malignant_std_deviations = []

malignant_variations = []

for column in data.columns[2:]:

    condition = data['Diagnosis'] == 'M'

    # print(column)

    # print(data.columns[int(column)+2])

    filtered_data = data.loc[condition]

    malignant_means.append(filtered_data[column].mean())

    malignant_std_deviations.append(filtered_data[column].std())

    malignant_variations.append(filtered_data[column].var())

pd.DataFrame({'Malignant Mean': malignant_means, 'Malignant Variance': malignant_v

ariations, 'Malignant Standard Deviation': malignant_std_deviations})

6. # mean variance and standard deviation: Class: benign

benign_means = []

benign_std_deviations = []

benign_variations = []

for column in data.columns[2:]:

    condition = data['Diagnosis'] == 'B'

    filtered_data = data.loc[condition]

    benign_means.append(filtered_data[column].mean())

    benign_std_deviations.append(filtered_data[column].std())

    benign_variations.append(filtered_data[column].var())

pd.DataFrame({'Benign Mean': benign_means, 'Benign Variance': benign_variations, '

Benign Standard Deviation': benign_std_deviations})

7. # Optimal thresholds for all attributes

column_specificity_map = {}

results = {}

for column in data.columns[2:]:

    # find min max and step

    num_bins = 20

    min = data.iloc[:,data.columns.get_loc(column)].min()

    max = data.iloc[:,data.columns.get_loc(column)].max()

    step = (max-min)/num_bins

    # get bins

    bins = [min]

    for i in range(1, num_bins):

        bins.append(bins[i-1]+step)

    class_m = np.histogram(data.loc[data.Diagnosis == 'M', column], bins=bins, no

rmed=False)[0]

    class_b = np.histogram(data.loc[data.Diagnosis == 'B', column], bins=bins, no

rmed=False)[0]

    total_class_m = sum(class_m)

    total_class_b = sum(class_b)

    new_data_m = [ item/total_class_m for item in class_m ]

    new_data_b = [ item/total_class_b for item in class_b ]

    new_data = pd.DataFrame({'M': new_data_m, 'B': new_data_b})

    new_data.plot.bar(title="Column: " + column)

    pyplot.show()

    # find the optimal threshold

    threshold_specificity_map = {}

    for i in range(0,num_bins):

        # a <= threshold -> class M

        # a > threshold -> class B

        class_m_correct = len([ item for item in data.loc[data.Diagnosis == 'M', 

column] if item <= i ])

        class_b_correct = len([ item for item in data.loc[data.Diagnosis == 'B', 

column] if item > i ])

        norm_class_m_correct = class_m_correct/total_class_m

        norm_class_b_correct = class_b_correct/total_class_b

        accuracy_1 = (class_m_correct + class_b_correct) / (total_class_m + total

_class_b)

        specificity_1 = (norm_class_m_correct + norm_class_b_correct)/2

        # a <= thresold -> class B

        # a > threshold -> class M

        class_b_correct = len([ item for item in data.loc[data.Diagnosis == 'B', 

column] if item <= i ])

        class_m_correct = len([ item for item in data.loc[data.Diagnosis == 'M', 

column] if item > i ])

        norm_class_m_correct = class_m_correct/total_class_m

        norm_class_b_correct = class_b_correct/total_class_b

        accuracy_2 = (class_m_correct + class_b_correct) / (total_class_m + total

_class_b)

        specificity_2 = (norm_class_m_correct + norm_class_b_correct)/2

        specificity = specificity_1

        accuracy = accuracy_1

        if specificity < specificity_2:

            specificity = specificity_2

            accuracy = accuracy_2

        threshold_specificity_map[i] = {'specificity': specificity, 'accuracy': a

ccuracy}

    # Get the optimal threshold

    max_specificity = -100

    max_accuracy = -100

    optimal_threshold = -100

    for threshold, item in threshold_specificity_map.items():

        if item['specificity'] > max_specificity:

            max_specificity = item['specificity']

            max_accuracy = item['accuracy']

            optimal_threshold = threshold

    # print("Optimal Threshold: ", optimal_threshold)

    # print("Accuracy: ", max_accuracy)

    # print("Error: ", 1-max_accuracy)

    column_specificity_map[column] = max_specificity

    results[column] = {

        'Optimal Threshold': optimal_threshold,

        'Accuracy': max_accuracy,

        'Error': 1-max_accuracy

    }

# print in order of prediciton ability

dict(sorted(column_specificity_map.items()))

pd.DataFrame(results).transpose().sort_values(by=['Accuracy'], ascending=False)

8.

from sklearn.model_selection import train_test_split

# Normalization to zero mean and unit variance

data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: (x-x.mean())/x.std())

train, test, train_labels, test_labels = train_test_split(data.iloc[:,2:], data.i

loc[:,1] ,test_size=0.40, random_state=3)

9. # KNN

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)

model.fit(train, train_labels)

nn1_predictions = model.predict(test)

accuracy_1 = model.score(test, test_labels)

print("1NN: Accuracy: ", accuracy_1)

model = KNeighborsClassifier(n_neighbors=3)

model.fit(train, train_labels)

nn2_predictions = model.predict(test)

accuracy_2 = model.score(test, test_labels)

print("3NN: Accuracy: ", accuracy_2)

pd.DataFrame({'1NN Accuracy': accuracy_1, '3NN Accuracy' : accuracy_2}, index=[0]

)

10.

print("Classification errors of 1NN:")

print("Prediction\tActual")

for prediction, actual in zip(nn1_predictions, test_labels):

    if prediction != actual:

        print(prediction, "\t\t", actual)

print("Classification errors of 3NN:")

print("Prediction\tActual")

for prediction, actual in zip(nn2_predictions, test_labels):

    if prediction != actual:

        print(prediction, "\t\t",actual)

11. # Fishers Linear Discriminant

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.metrics import accuracy_score, confusion_matrix

fisher_classifier = LinearDiscriminantAnalysis()

fisher_classifier.fit(train, train_labels)

fisher_predictions = fisher_classifier.predict(test)

print("Fisher's Accuracy: ", accuracy_score(test_labels, fisher_predictions))

# Confusion Matrix

print("Confusion Matrix")

print(confusion_matrix(fisher_predictions, test_labels))

print("Classification errors of Fisher's:")

print("Prediction\tActual")

for prediction, actual in zip(fisher_predictions, test_labels):

    if prediction != actual:

        print(prediction, "\t\t", actual)

  • Data Mining and Neural Networks
    • Computational Task 1
      • Task 1
      • Task 2
      • Task 3
      • Task 4
      • Task 5
      • Task 6
      • Appendix