data mining and neutral networks using python
Data Mining and Neural Networks
Computational Task 1
Task 1 a. What is the problem authors aimed to solve?
Authors aimed to distinguish malignant from benign breast cancer, using nuclear size, shape, and
texture as features.
b. Which methods did they use?
The authors used Inductive machine learning and logistic regression to correctly label malignant or
benign.
c. How did they test the accuracy of classification?
The authors used Cross-validation to test the accuracy of the predicted results. The accuracy of
logistic regression was 96.2% whereas the accuracy of inductive machine learning was 97.5%.
Task 2 For task 2, the data table from ics.uci.edu was downloaded as wdbc.data file. Here there are in total 32
columns with 1 ID column, 1 Diagnosis column and 30 attribute columns. Here the 30 are divided into 3
groups of mean, standard error, and worst radii. There are 212 malignant cases (M) and 357 benign cases
(B) as shown in the Figure 1.
Figure 1. Number of features and count of each target class
The following are the mean, variance and standard deviation of all attributes starting from column 3-32
shown in the Figure 2. These are calculated before normalizing the attributes to unit variance.
Figure 2. Mean, Variance and Standard Deviation of each attribute (0-29)
The following are the mean, variance, and standard deviation of all attributes for Malignant class (M) in the
Figure 3.
Figure 3. Mean, Variance and Standard Deviation of each attribute for Malignant class (M)
The following are the mean, variance, and standard deviation of all attributes for Benign class (B) in the
Figure 4.
Figure 4. Mean, Variance and Standard Deviation of each attribute for Benign class (B)
The attributes are not normalized as we can tell based on the mean, variance, and standard deviations. To
normalize we will subtract the mean of each attribute from each value of the attribute to get zero mean
and we divide it with the standard deviation to get unit variance as shown in the Figure 5.
Figure 5. Mean and standard deviation after normalization
Task 3
To create predictors by one attribute, we plotted histograms for each attribute and each class. Following
are some of the histograms shown in Figure 6.
Figure 6. Histogram plots of first 4 columns
To calcuate the optimal threshold for each single attribute classifier, we have set the threshold from 0-20
(bins) and calcuated the accuracy and specificity. Here we chose the threshold that maximizes the
accuracy. The following are the thresholds of each single attribute classifier shown in the Figure 7.
Figure 7. Optimal Thresholds of all single attribute classifiers sorted by accuracy
From Figure 7, we can determine that attribute ‘20’ gives the best accuracy with least classification errors.
The following are some of the classification rules:
Attribute Accuracy Error Threshold Classification Rule
20 89.99% 10.03% 16 If x <= 16 then Class B else Class M
0 89.39% 10.60% 15 If x <= 15 then Class B else Class M
12 80.63% 19.36% 3 If x > 3 then Class M else Class B Table 1. Classification rules of the top 3 single attribute classifiers
Task 4
To test 1NN and 3NN classification rules, we normalized the values to zero mean and unit variance. We
also divided the dataset into 60% training data and 40% test data to test the classification accuracy and
error. The following Figure 8 shows the accuracy of both 1NN and 3NN classifiers.
Figure 8. Accuracy of 1NN and 3NN classifiers
The Figure 9 shows the classification errors of both 1NN and 3NN classifiers
Figure 9. Classification errors of 1NN and 3NN classifiers
Based on this, 3NN has more accuracy than compared to 1NN classifier, hence 3NN classifier is better in
classifying the malignant vs benign cancers.
Class 1
Class 2
Task 5
Fischer’s linear discriminant is used to obtain a hyperplane which optimizes the signal-to-noise ratio or the
hyperplane that maximizes the distance between means of projected instances and minimized the
variance among the projected instances of each class. That is, it tries to find the hyperplane that reduces
the distance between two groups of projects instances and in which the groups are closely packed with
one another.
Figure 10. Shows a hyperplane that divides all projections clearly
Here the projections of all data points on the hyperplane are well separated and the projections are also
closely packed. This allows us to take a normal to the hyperplane and classify.
Fisher’s Linear Discriminant hence finds the hyperplane by maximizing the following ratio:
Here (w⃗) is normal to the hyperplane.
Task 6
Applied Fisher’s linear discriminant to the Breast Cancer Wisconsin (Diagnostic) data set using sklearn’s
LinearDiscriminantAnalysis classifier. Figure 11 shows the accuracy of Fisher’s classifier:
Figure 11. Accuracy of Fisher’s Linear Discriminant
Figure 12 shows the confusion matrix and classification errors:
Figure 12. Confusion matrix and classification errors of Fischer’s Classifier
Compared the 1NN, this method provided more accuracy but on par with the accuracy of 3NN methods.
Appendix
1. # Import statements
import pandas as pd
import numpy as np
from sklearn import preprocessing
from matplotlib import pyplot
2. # Data import
headers = ['ID', 'Diagnosis']
headers.extend([str(i) for i in range(30)])
data = pd.read_csv('wdbc.data', sep=",", header=None, names=headers)
data
3. # Stats
attributes = data.shape[1] - 2 # remove id and class count
benign, malignant = 0, 0
for index, row in data.iterrows():
if row[1] == 'M':
malignant += 1
elif row[1] == 'B':
benign += 1
else:
print(row[1])
print("There are {} attributes".format(attributes))
print("There are {} malignant cases (M) and {} benign cases (B)".format(malignant
, benign))
4. # mean variance and standard deviation: All classes
all_means = []
all_std_deviations = []
all_variations = []
for column in data.columns[2:]:
all_means.append(data[column].mean())
all_std_deviations.append(data[column].std())
all_variations.append(data[column].var())
pd.DataFrame({'Mean': all_means, 'Variance': all_variations, 'Standard Deviation':
all_std_deviations})
5. # mean variance and standard deviation: Class: malignant
malignant_means = []
malignant_std_deviations = []
malignant_variations = []
for column in data.columns[2:]:
condition = data['Diagnosis'] == 'M'
# print(column)
# print(data.columns[int(column)+2])
filtered_data = data.loc[condition]
malignant_means.append(filtered_data[column].mean())
malignant_std_deviations.append(filtered_data[column].std())
malignant_variations.append(filtered_data[column].var())
pd.DataFrame({'Malignant Mean': malignant_means, 'Malignant Variance': malignant_v
ariations, 'Malignant Standard Deviation': malignant_std_deviations})
6. # mean variance and standard deviation: Class: benign
benign_means = []
benign_std_deviations = []
benign_variations = []
for column in data.columns[2:]:
condition = data['Diagnosis'] == 'B'
filtered_data = data.loc[condition]
benign_means.append(filtered_data[column].mean())
benign_std_deviations.append(filtered_data[column].std())
benign_variations.append(filtered_data[column].var())
pd.DataFrame({'Benign Mean': benign_means, 'Benign Variance': benign_variations, '
Benign Standard Deviation': benign_std_deviations})
7. # Optimal thresholds for all attributes
column_specificity_map = {}
results = {}
for column in data.columns[2:]:
# find min max and step
num_bins = 20
min = data.iloc[:,data.columns.get_loc(column)].min()
max = data.iloc[:,data.columns.get_loc(column)].max()
step = (max-min)/num_bins
# get bins
bins = [min]
for i in range(1, num_bins):
bins.append(bins[i-1]+step)
class_m = np.histogram(data.loc[data.Diagnosis == 'M', column], bins=bins, no
rmed=False)[0]
class_b = np.histogram(data.loc[data.Diagnosis == 'B', column], bins=bins, no
rmed=False)[0]
total_class_m = sum(class_m)
total_class_b = sum(class_b)
new_data_m = [ item/total_class_m for item in class_m ]
new_data_b = [ item/total_class_b for item in class_b ]
new_data = pd.DataFrame({'M': new_data_m, 'B': new_data_b})
new_data.plot.bar(title="Column: " + column)
pyplot.show()
# find the optimal threshold
threshold_specificity_map = {}
for i in range(0,num_bins):
# a <= threshold -> class M
# a > threshold -> class B
class_m_correct = len([ item for item in data.loc[data.Diagnosis == 'M',
column] if item <= i ])
class_b_correct = len([ item for item in data.loc[data.Diagnosis == 'B',
column] if item > i ])
norm_class_m_correct = class_m_correct/total_class_m
norm_class_b_correct = class_b_correct/total_class_b
accuracy_1 = (class_m_correct + class_b_correct) / (total_class_m + total
_class_b)
specificity_1 = (norm_class_m_correct + norm_class_b_correct)/2
# a <= thresold -> class B
# a > threshold -> class M
class_b_correct = len([ item for item in data.loc[data.Diagnosis == 'B',
column] if item <= i ])
class_m_correct = len([ item for item in data.loc[data.Diagnosis == 'M',
column] if item > i ])
norm_class_m_correct = class_m_correct/total_class_m
norm_class_b_correct = class_b_correct/total_class_b
accuracy_2 = (class_m_correct + class_b_correct) / (total_class_m + total
_class_b)
specificity_2 = (norm_class_m_correct + norm_class_b_correct)/2
specificity = specificity_1
accuracy = accuracy_1
if specificity < specificity_2:
specificity = specificity_2
accuracy = accuracy_2
threshold_specificity_map[i] = {'specificity': specificity, 'accuracy': a
ccuracy}
# Get the optimal threshold
max_specificity = -100
max_accuracy = -100
optimal_threshold = -100
for threshold, item in threshold_specificity_map.items():
if item['specificity'] > max_specificity:
max_specificity = item['specificity']
max_accuracy = item['accuracy']
optimal_threshold = threshold
# print("Optimal Threshold: ", optimal_threshold)
# print("Accuracy: ", max_accuracy)
# print("Error: ", 1-max_accuracy)
column_specificity_map[column] = max_specificity
results[column] = {
'Optimal Threshold': optimal_threshold,
'Accuracy': max_accuracy,
'Error': 1-max_accuracy
}
# print in order of prediciton ability
dict(sorted(column_specificity_map.items()))
pd.DataFrame(results).transpose().sort_values(by=['Accuracy'], ascending=False)
8.
from sklearn.model_selection import train_test_split
# Normalization to zero mean and unit variance
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: (x-x.mean())/x.std())
train, test, train_labels, test_labels = train_test_split(data.iloc[:,2:], data.i
loc[:,1] ,test_size=0.40, random_state=3)
9. # KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
model.fit(train, train_labels)
nn1_predictions = model.predict(test)
accuracy_1 = model.score(test, test_labels)
print("1NN: Accuracy: ", accuracy_1)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(train, train_labels)
nn2_predictions = model.predict(test)
accuracy_2 = model.score(test, test_labels)
print("3NN: Accuracy: ", accuracy_2)
pd.DataFrame({'1NN Accuracy': accuracy_1, '3NN Accuracy' : accuracy_2}, index=[0]
)
10.
print("Classification errors of 1NN:")
print("Prediction\tActual")
for prediction, actual in zip(nn1_predictions, test_labels):
if prediction != actual:
print(prediction, "\t\t", actual)
print("Classification errors of 3NN:")
print("Prediction\tActual")
for prediction, actual in zip(nn2_predictions, test_labels):
if prediction != actual:
print(prediction, "\t\t",actual)
11. # Fishers Linear Discriminant
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix
fisher_classifier = LinearDiscriminantAnalysis()
fisher_classifier.fit(train, train_labels)
fisher_predictions = fisher_classifier.predict(test)
print("Fisher's Accuracy: ", accuracy_score(test_labels, fisher_predictions))
# Confusion Matrix
print("Confusion Matrix")
print(confusion_matrix(fisher_predictions, test_labels))
print("Classification errors of Fisher's:")
print("Prediction\tActual")
for prediction, actual in zip(fisher_predictions, test_labels):
if prediction != actual:
print(prediction, "\t\t", actual)
- Data Mining and Neural Networks
- Computational Task 1
- Task 1
- Task 2
- Task 3
- Task 4
- Task 5
- Task 6
- Appendix