Assignment Dataset

profilepra1234
BreastCancer_SVM.pdf

Application of Data Mining Classification Algorithms for Breast Cancer Diagnosis

Hajar Saoud 1

1 LIST Laboratory, UAE

Tangier, Morocco

[email protected]

Mohamed Ghailani 2

2 LabTIC Laboratory, UAE

Tangier, Morocco

[email protected]

Abderrahim Ghadi 1

1 LIST Laboratory, UAE

Tangier, Morocco

[email protected]

ABSTRACT Breast cancer is one of the diseases that represent a large number

of incidence and mortality in the world. Data mining

classifications techniques will be effective tools for classifying

data of cancer to facilitate decision-making.

The objective of this paper is to compare the performance of

different machine learning algorithms in the diagnosis of breast

cancer, to define exactly if this type of cancer is a benign or

malignant tumor.

Six machine learning algorithms were evaluated in this research

Bayes Network (BN), Support Vector Machine (SVM), k-nearest

neighbors algorithm (Knn), Artificial Neural Network (ANN),

Decision Tree (C4.5) and Logistic Regression. The simulation of

the algorithms is done using the WEKA tool (The Waikato

Environment for Knowledge Analysis) on the Wisconsin breast

cancer dataset available in UCI machine learning repository.

Keywords Brest cancer; diagnostic; machines learning algorithms;

classification; WEKA

1. INTRODUCTION Breast cancer is a disease where cancer cells form in the tissue of

the breast of the women and can propagate to the others organs of

the body. It represents the second cause of death for women after

lung cancer [1]. Early diagnosis can reduce the breast cancer

mortality rate by 40% [2].

Data mining and machines learning algorithms have become

interesting tools in the field of health. Because they can process

and analyze massive data to extract useful information in

decision-making. So they will be effective solutions to predict and

diagnose breast cancer also to classify it into its two categories

either benign or malignant tumor.

In this paper we examined the performance and accuracy of six

machines learning algorithms in the classification and diagnosis of

the breast cancer: Bayes Network (BN), Support Vector Machine

(SVM), k-Nearest Neighbors Algorithm (Knn), Artificial Neural

Network (ANN), Decision Tree (C4.5) and Logistic Regression

on Wisconsin breast cancer dataset using the WEKA

environment.

The rest of this paper is structured as follows. Part two is a

presentation of breast cancer. Part three gives a vision about

similar research. Parts four and five give a theoretic presentation

of data mining and machine learning algorithms. Part six

describes the database used. Part seven shows the experiments

performed by WEKA software on Wisconsin breast cancer

dataset. The results of these experiments are represented in part

height and finally conclusion and perspectives in part nine.

2. BREAST CANCER Breast cancer is an abnormal production of cells in the breast that

grow in an anarchic way. The masses of cells formed in the breast

are called tumors. The cancer cells can stay in the breast or spread

to other organs of the body. This is called metastasis.

2.1 Types of breast cancer Breast cancer is decomposed into two types benign and malignant

tumors:

 Benign tumors are non-dangerous tumors, they have well-defined contours. They develop slowly in the organ

where they appeared without producing metastatic

cases. Benign tumors are composed of cells that

resemble to normal cells of the breast tissue.

 Malignant tumors are dangerous tumors, because they spread to other organs of the body and can produce

metastatic cases. Cancer cells of malignant tumors have

several abnormalities compared with normal cells in

shape, size and contours where cells lose their original

characteristics.

2.2 Causes of breast cancer The first risk factor that can increase the probability of breast

cancer is the age factor, the risk of breast cancer increases with

age. Other factors that can intervene like:

2.2.1 Family or genetic factors  Gender: women are the most infected with breast

cancer.

 A woman history: The woman that had already breast cancer in one breast, she has an increased risk to have

cancer in the other breast.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for components of this work owned

by others than ACM must be honored. Abstracting with credit is permitted. To

copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specific permission and/or a fee. Request permissions from

[email protected].

SCA '18, October 10–11, 2018, Tetouan, Morocco

© 2018 Association for Computing Machinery.

ACM ISBN 978-1-4503-6562-8/18/10…$15.00

https://doi.org/10.1145/3286606.3286861

Boudhir Anouar Abdelhakim 1

1 LIST Laboratory, UAE

Tangier, Morocco

[email protected]

Abderrahim Ghadi 1

1 LIST Laboratory, UAE

Tangier, Morocco

[email protected]

Mohamed Ghailani 2

2 LabTIC Laboratory, UAE

Tangier, Morocco

[email protected]

Hajar Saoud 1

1 LIST Laboratory, UAE

Tangier, Morocco

[email protected]

SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.

 A family history: If several parents of the woman has been diagnosed with breast cancer, especially at a

younger age, the risk to develop breast cancer increases.

 Genetic factors: Some genetic mutations increase the risk of breast cancer.

2.2.2 Characteristics of the individual  Obesity: The obesity increases the risk of breast cancer.

 Having period in early age: Having the period before the age of 12 increases the risk of breast cancer.

 Late menopause: Woman that started menopause at a later age, she is more likely to develop breast cancer.

 Having the first child in old age: Women who give birth to their first child after the age of 30 may have an

increased risk of breast cancer.

 Women who have never been pregnant: The fact of not having a child increases the risk of developing breast

cancer.

 Hormone replacement therapy: (Estrogen and progesterone) increases the risk of having breast cancer

after 5 years of treatment

 Drinking Alcohol: Drinking alcohol increases the risk of breast cancer.

3. RELATED WORKS Several researches have been carried out in this field, they used

data mining and machine learning algorithms to classify and

diagnose the different types of cancers.

Hiba Asria et al [3] to predict breast cancer they compared the

performance of four classifiers: Support Vector Machine (SVM),

k-nearest neighbors algorithm (Knn), Decision Tree (C4.5) and

Naive Bayes (NB) on Wisconsin breast cancer dataset with the

platform WEKA. The results showed that the classifier that

produced the highest accuracy is the SVM with accuracy of

97.13%.

Akinsola Adeniyi F et.al [4] in this research they evaluated three

classification algorithms C4.5, Multi-layer perceptron and Naive

Bayes using the WEKA environment. The results showed that the

best algorithm is C4.5 with the highest accuracy of 93.9854% in

breast cancer classification.

PG Student et.al [5] tried to classify the SEER breast cancer

database into two categories "Carcinoma in situ" and "Malignant

potential" using C4.5. They obtained the accuracy of 94% in the

training phase and 93% in the test phase.

Dana Bazazeh and Raed Shubair [6] have studied and compared

three machine learning algorithms: Random Forest (RF), Vector

Support Machine (SVM), Bayesian Networks (BN) on Wisconsin

Breast Cancer dataset. The results showed that Support Vector

Machine (SVM) has the best performance in accuracy, specificity,

and precision, and Random Forest (RF) classifies tumors

correctly.

Chintan Shah and Anjali G. Jivani [7] in this research they

examined three machine learning algorithms: Decision Tree,

Bayesian Network and K-Nearest Neighbor Algorithms to predict

Breast Cancer. The criteria on which they are based the lowest

calculation time and the precision. The algorithm that produced

the best result is Naïve Bayes with the accuracy of 95.9943% and

the calculation time 0.02.

Zahra Nematzadeh et.al [8] in this research they compared four

machine learning algorithms: Decision Tree, Naive Bayes, Neural

Network and Vector Support Machine algorithm on the breast

cancer databases original and prognostic Wisconsin breast cancer.

In this paper they studied the impact of k in k-fold cross validation

to obtain the best accuracy.

S.Kharya and. al [9] in this paper they have studied and compared

theoretically four techniques: Support vector machine, Bayesian

belief network and Artificial neural network for the prediction of

breast cancer. They concluded that Bayesian belief network is the

beast technique that can be used to predict breast cancer.

4. DATA MINING Data mining [10] means the extraction of knowledge and

information from massive data coming from different sources.

Also it can be defined as a step in the Knowledge Discovery from

Data (KDD) knowledge discovery process that can be presented

in the following way:

Figure 1: Processes of knowledge Discovery from data (KDD)

[10].

Or:

 Data cleaning: this task aims to remove noise from data.

 Data integration: this task combines data from different sources.

 Data selection: this task aims to extract relevant data for the analysis task.

 Data transformation: the role of this phase is to perform the operations of summary and aggregation to transform

data into forms appropriate for doing mining task.

 Data mining: it’s an important phase that aims to extract knowledge by applying data mining technics.

 Pattern evaluation: the aim of this phase is to identify interesting models where the representation of

knowledge is based on certain measures of interest.

5. MACHINES LEARNING ALGORITHMS

5.1 Bayes Network Bayes Networks [11], also called (Bayesian belief networks), are

methods that are widely used for modeling and presenting

knowledge of uncertain domains. Bayes Network is a directed

acyclic graph (DAG), consisting of several nodes that represent

variables and arcs that represent the probabilistic dependencies

between those variables. Figure 2 shows an example of a simple

Bayes Network.

SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.

Figure2: simple structure of a Bayes Network [11].

All variables are conditionally independent from their not progeny

because each node has its own parent nodes. So Bayes Network

calculates joint probability of a set of variables {Y1, Y2,….., Yn} by decomposing it into product of conditional probability

distributions on each variable given its parents in the graph. So the

joint probability of all the nodes can be written in the following

way:

( ) ∏ ( ( ))

Where ( ) represent the set of parent variables.

5.2 Support Vector Machines (SVM) Support Vector Machines [10] are supervised learning models that

can be used in prediction also in the classification of the linear and

nonlinear data. The principle of the SVM algorithm is to use a

non-linear mapping to transform the original learning data into a

larger dimension. In this new dimension, it looks for the linear

hyperplane of optimal separation. SVM algorithm aims to find a

hyperplane with the largest margin named maximum marginal

hyperplane (MMH) to be more accurate in the classification of

future data i.e. it looks the shortest distance between the MMH

and the closest training tuple of each class.

Figure 3: The two possible separation hyperplanes and their

associated margins [10].

5.3 k-nearest neighbors algorithm (Knn) The k-nearest neighbors classifiers [10] are based on analog

learning, i.e. they compare a given test tuple with similar learning

tuples. They classify the tuples using more than one nearest

neighbor. The principle of the k-nearest neighbors classifier is that

it looks in the space model for the K test tuples closest to the

unknown tuple. These tuples are named (k nearest neighbors) of

the unknown tuple.

To know the K tuples closest to the tuple tests, it is necessary to

calculate and compare the distances. Several distance calculation

methods exist. The Euclidean distance is one of them. The

Euclidean distance between two points or tuples, X1 = (x11, x12,

..., x1n) et X2 = (x21, x22, ..., x2n), are :

( ) √∑( )

The definition of the most appropriate K-nearest-neighbor is done

manually starting with K = 1 and calculating the classification

error rate. The appropriate k gives the minimal error.

5.4 Decision Tree (C4.5) The decision tree algorithm [12] is a classification algorithm that

is similar to an organizational chart where the internal nodes (not-

leaf) of a decision tree represent the tests on the attributes, the

branches represent the results of the test and the external nodes

(leaves) represent the predicted results. At each node, the

algorithm chooses the best attribute to partition the data into

individual classes. At the end a tree will be built when selecting

subsets from the data provided.

Figure 4: Example of a decision tree [12].

The figure shows an example of a decision tree that aims to

investigate the possibility of a customer buying a computer. Each

internal (not-leaf) node represents a test on an attribute. Each leaf

node represents a class (either buy a computer = yes, or buy a

computer = no).

There are two ways to build a decision tree, from top to bottom

and from bottom to top. There are several types of descendant

decision tree, for example CART, C4.5 and ID3

5.5 Logistic regression Logistic regression [13] is one of the generalized linear models

much used in machine learning. Logistic regression predicts the

probability of a result that can take two values from a set of

predictor variables. Logistic regression is mainly used for

prediction and also to calculate the probability of success.

5.6 Artificial Neural Network (ANN) Neural Network [14] can be defined as a reasoning model based

on the human brain. An Artificial Neural Network is a set of

processors (or neurons) very simple, very interconnected by

weighted connections to pass signals from one neuron to another

and they operate in a parallel manner. These neurons are similar to

the biological neurons of the human brain. Artificial Neural

Network consists of three layers: input layer, output layer and

between them they are extra layers called hidden layers.

SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.

Figure 5: Architecture of an Artificial Neural Network [14].

6. DESCRIPTION OF THE DATASET The database that we used in this research is the Wisconsin breast

cancer dataset available in UCI machine learning repository [15].

It contains 699 records (458 benign tumors and 241 malignant

tumors). It is composed of 11 variables 10 predictor variables and

one result variable that shows whether the tumor is benign or

malignant. The predictive attributes vary between 0 and 10. The

value 0 corresponds to the normal state and the value 10

corresponds to the most abnormal state.

The table above presents the description of the 11 attributes of the

Wisconsin breast cancer dataset:

Attributes Description

Id A code for the identification of each line

Clump Thickness The benign cells are grouped in

monolayers whereas the cancer cells are

grouped in multilayers.

Uniformity of

Cell Size

The size of the cancer cells.

Uniformity of

Cell Shape

The shape of the cancer cells.

Marginal

Adhesion

Cancer cells can lose their tights; this is a

sign of malignant cancer.

Single Epithelial

Cell Size

Single Epithelial Cell Size.

Bare Nuclei The nuclei are not surrounded by the rest

of the cell in benign tumors.

Bland Chromatin Cancer cells have coarse chromatin.

Normal Nucleoli In cancer cells, the nucleoli are

transforming into protuberant, but the

nucleoli are small.

Mitoses Cell growth.

Class If the cancer is a benign tumor or

malignant tumor.

Table 1: The 11 attributes of the breast cancer database.

7. EXPERIMENTATION The platform that we used to apply the machine learning

algorithms on the breast cancer database is WEKA [16], because

WEKA is a collection of open source machine learning

algorithms, which allows realizing the tasks of data mining to

solve real world problems. It contains tools for data

preprocessing, classification, regression, grouping, and

association rules. Also it offers an environment to develop new

models.

The figures below show the performance of machines learning

algorithms using the WEKA environment:

7.1 Bayes Network

Figure 6: Results of running Bayes Network (BN) on the

database.

7.2 Support Vector Machines (SVM)

Figure 7: Results of running Support Vector Machine (SVM)

on the database.

7.3 k-nearest neighbors algorithm (Knn)

SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.

Figure 8: Results of the execution of k-nearest neighbors

algorithm (Knn) on the database.

7.4 Decision Tree (C4.5)

Figure 9: Results of the execution of Decision Tree (C4.5) on

the database.

7.5 Logistic regression

Figure 10: Results of running Logistic regression on the

database.

7.6 Artificial Neural Network (ANN)

Figure 11: Results of running the Artificial Neural Network

(ANN) on the database.

8. RESULTS AND EVALUATION

8.1 K-fold Cross-validation To evaluate the performance of machine learning algorithms

based on breast cancer data we used the K-fold cross validation

test method. This method aims to divide the database in two sets,

the training data to run the model and the test data to evaluate the

performance of the model. This is the most used method in the

evaluation of machine learning techniques.

8.2 Evaluation We will evaluate the performance of the compared methods at the

level of accuracy, the time taken, well classified instances, wrong

classified instances:

Methods accuracy well

classified

instance

wrong

classified

instance

time taken

BN 97.2818

%

680 19 0.04

SVM 97.2818

%

680 19 0.08

LR 96.5665

%

675 24 0.04

ANN 95.422 % 667 32 1.42

KNN 95.279 % 666 33 0

DT

(C4.5)

95.1359

%

665 34 0.03

Table 2: Evaluation of the performance of the methods

Methods well

classified

instance

for

benign

well

classified

instance

for

malignant

Precision

for

benign

Precision

for

malignant

Average

BN 442 238 99.3% 93.7% 97.4%

SVM 443 237 99.1% 94% 97.4%

LR 446 229 97.4% 95% 96.6%

ANN 442 225 96.5% 93.4% 95.4%

KNN 445 221 95.7% 94.4% 95.3%

DT

(C4.5)

438 227 96.9% 91.9% 95.2%

Table 3: Evaluation of the performance of the methods for

each type of cancer

Also by evaluating:

 Kappa Statistic: measures the agreement between observers during a qualitative coding in categories.

 Mean Absolute Error: The Mean absolute error is the average of the absolute differences between prediction

and actual observation.

 Mean Squared Error: The mean squared error is the average of squared differences between prediction and

actual observation.

SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.

 Root relative squared error.

 Relative Absolute Error.

Methods Kappa

Statistic

Mean

Absolute

Error

Mean

Squared

Error

Root

relative

squared

error

Relative

Absolute

Error

BN 0.9406 0.0289 0.1623 34.1484

%

6.3954

%

SVM 0.9405 0.0272 0.1649 34.6872

%

6.0141

%

LR 0.924 0.0486 0.1667 35.0808

%

10.7588

%

ANN 0.8987 0.0497 0.2031 42.7227

%

11.0061

%

KNN 0.8948 0.0473 0.2128 44.7723

%

10.4642

%

DT 0.893 0.0637 0.2142

45.0739

%

14.1031

%

Table 4: Evaluation of the errors of the methods

8.3 Confusion matrix The confusion matrix contains information about real

classifications or (current) and predicted:

predicted benign

predicted

malignant

actual

benign

TP (True

Positives)

FN (False

Negatives)

actual

malignant

FP (False

Positives)

TN (True

Negatives)

Table 5: Matrix of confusion

TP: the cases predicted as benign tumors, they are in fact benign

tumors.

TN: the cases predicted as malignant tumors, they are in fact

malignant tumors.

FP: the cases predicted as benign tumors but in the reality they are

malignant tumors.

FN: the cases predicted as malignant tumors but in the reality they

are benign tumors.

From the confusion matrix we can calculate:

 accuracy =

 Sensitivity =

 Specificity =

The diagram in Figure 12 shows the accuracy, sensitivity and

specificity of the classifiers.

Figure 12: Comparison of the Accuracy, Sensitivity and

Specificity.

9. CONCLUSION To conclude, in this paper we examined six machines learning

techniques for the diagnosis of breast cancer. The database that

we used is Wisconsin breast cancer dataset with WEKA

simulation environment. The algorithms that gave the highest

accuracy are Bayes Network and Support Vector Machine (SVM)

with an accuracy of 97.2818%. But the Bayes Network classifier

will be the best algorithm for the diagnosis and classification of

breast cancer compared with the vector support machine (SVM)

because it produced a minimal time comparing it with the time

taken by the support vector machine (SVM). In future research we

will propose a hybrid approach that aims to improve the accuracy

of the classifier Bayes Network by combining it with the methods

of features selection.

10. REFERENCES [1] « Breast cancer statistics », World Cancer Research Fund,

22- August -2018. Available on:

https://www.wcrf.org/dietandcancer/cancer-trends/breast-

cancer-statistics.

[2] K. Ganesan, U. R. Acharya, C. K. Chua, L. C. Min, K. T. Abraham, and K.-H. Ng, « Computer-Aided Breast Cancer

Detection Using Mammograms: A Review », IEEE Rev.

Biomed. Eng., vol. 6, p. 77‑98, 2013.

[3] H. Asri, H. Mousannif, H. A. Moatassime, et T. Noel, « Using Machine Learning Algorithms for Breast Cancer

Risk Prediction and Diagnosis », Procedia Comput. Sci., vol.

83, p. 1064‑1069, 2016.

[4] IEEE Staff and IEEE Staff, 2009 IEEE International Advance Computing Conference. 2009.

[5] K. Rajesh and S. Anand, « Analysis of SEER dataset for breast cancer diagnosis using C4. 5 classification

algorithm », Int. J. Adv. Res. Comput. Commun. Eng., vol.

1, no 2, p. 2278–1021, 2012.

[6] D. Bazazeh and R. Shubair, « Comparative study of machine learning algorithms for breast cancer detection and

diagnosis », in Electronic Devices, Systems and Applications

(ICEDSA), 2016 5th International Conference on, 2016, p.

1–4.

[7] C. Shah and A. G. Jivani, « Comparison of data mining classification algorithms for breast cancer prediction », in

Computing, Communications and Networking Technologies

(ICCCNT), 2013 Fourth International Conference on, 2013,

p. 1–4.

0,86

0,88

0,9

0,92

0,94

0,96

0,98

1

BN SVM LG ANN KNN DT

Précision

Sensitivity

Specificity

SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.

[8] Z. Nematzadeh, R. Ibrahim, and A. Selamat, « Comparative studies on breast cancer classifications with k-fold cross

validations using machine learning techniques », in Control

Conference (ASCC), 2015 10th Asian, 2015, p. 1–6.

[9] S. Kharya, D. Dubey, and S. Soni, « Predictive Machine Learning Techniques for Breast Cancer Detection », IJCSIT

Int. J. Comput. Sci. Inf. Technol., vol. 4, no 6, p. 1023–1028,

2013.

[10] J. Han and M. Kamber, Data mining: concepts and techniques, 2. ed., [Nachdr.]. Amsterdam: Elsevier/Morgan

Kaufmann, 2010.

[11] A. Mahmood, « Structure Learning of Causal Bayesian Networks: A Survey », p. 6.

[12] J. Han and M. Kamber, Data mining: concepts and techniques, 3rd ed. Burlington, MA: Elsevier, 2011.

[13] H. Yusuff, N. Mohamad, U. Ngah, and A. Yahaya, « Breast cancer analysis using logistic regression », Int. J. Res. Appl.

Stud., vol. 11, 2012.

[14] M. Negnevitsky, Artificial intelligence: a guide to intelligent systems, 2nd ed. Harlow, England ; New York: Addison-

Wesley, 2005.

[15] « UCI Machine Learning Repository: Breast Cancer Wisconsin (Original) Data Set ». Available on:

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wiscons

in+(original).

[16] « Machine Learning Project at the University of Waikato in New Zealand ». Available in:

https://www.cs.waikato.ac.nz/ml/index.html.

[17] Saoud H., Ghadi A., Ghailani M. (2018) Analysis of Evolutionary Trends of Incidence and Mortality by Cancers.

In: Ben Ahmed M., Boudhir A. (eds) Innovations in Smart

Cities and Applications. SCAMS 2017. Lecture Notes in

Networks and Systems, vol 37. Springer, Cham