Assignment Dataset
Application of Data Mining Classification Algorithms for Breast Cancer Diagnosis
Hajar Saoud 1
1 LIST Laboratory, UAE
Tangier, Morocco
Mohamed Ghailani 2
2 LabTIC Laboratory, UAE
Tangier, Morocco
Abderrahim Ghadi 1
1 LIST Laboratory, UAE
Tangier, Morocco
ABSTRACT Breast cancer is one of the diseases that represent a large number
of incidence and mortality in the world. Data mining
classifications techniques will be effective tools for classifying
data of cancer to facilitate decision-making.
The objective of this paper is to compare the performance of
different machine learning algorithms in the diagnosis of breast
cancer, to define exactly if this type of cancer is a benign or
malignant tumor.
Six machine learning algorithms were evaluated in this research
Bayes Network (BN), Support Vector Machine (SVM), k-nearest
neighbors algorithm (Knn), Artificial Neural Network (ANN),
Decision Tree (C4.5) and Logistic Regression. The simulation of
the algorithms is done using the WEKA tool (The Waikato
Environment for Knowledge Analysis) on the Wisconsin breast
cancer dataset available in UCI machine learning repository.
Keywords Brest cancer; diagnostic; machines learning algorithms;
classification; WEKA
1. INTRODUCTION Breast cancer is a disease where cancer cells form in the tissue of
the breast of the women and can propagate to the others organs of
the body. It represents the second cause of death for women after
lung cancer [1]. Early diagnosis can reduce the breast cancer
mortality rate by 40% [2].
Data mining and machines learning algorithms have become
interesting tools in the field of health. Because they can process
and analyze massive data to extract useful information in
decision-making. So they will be effective solutions to predict and
diagnose breast cancer also to classify it into its two categories
either benign or malignant tumor.
In this paper we examined the performance and accuracy of six
machines learning algorithms in the classification and diagnosis of
the breast cancer: Bayes Network (BN), Support Vector Machine
(SVM), k-Nearest Neighbors Algorithm (Knn), Artificial Neural
Network (ANN), Decision Tree (C4.5) and Logistic Regression
on Wisconsin breast cancer dataset using the WEKA
environment.
The rest of this paper is structured as follows. Part two is a
presentation of breast cancer. Part three gives a vision about
similar research. Parts four and five give a theoretic presentation
of data mining and machine learning algorithms. Part six
describes the database used. Part seven shows the experiments
performed by WEKA software on Wisconsin breast cancer
dataset. The results of these experiments are represented in part
height and finally conclusion and perspectives in part nine.
2. BREAST CANCER Breast cancer is an abnormal production of cells in the breast that
grow in an anarchic way. The masses of cells formed in the breast
are called tumors. The cancer cells can stay in the breast or spread
to other organs of the body. This is called metastasis.
2.1 Types of breast cancer Breast cancer is decomposed into two types benign and malignant
tumors:
Benign tumors are non-dangerous tumors, they have well-defined contours. They develop slowly in the organ
where they appeared without producing metastatic
cases. Benign tumors are composed of cells that
resemble to normal cells of the breast tissue.
Malignant tumors are dangerous tumors, because they spread to other organs of the body and can produce
metastatic cases. Cancer cells of malignant tumors have
several abnormalities compared with normal cells in
shape, size and contours where cells lose their original
characteristics.
2.2 Causes of breast cancer The first risk factor that can increase the probability of breast
cancer is the age factor, the risk of breast cancer increases with
age. Other factors that can intervene like:
2.2.1 Family or genetic factors Gender: women are the most infected with breast
cancer.
A woman history: The woman that had already breast cancer in one breast, she has an increased risk to have
cancer in the other breast.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To
copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
SCA '18, October 10–11, 2018, Tetouan, Morocco
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-6562-8/18/10…$15.00
https://doi.org/10.1145/3286606.3286861
Boudhir Anouar Abdelhakim 1
1 LIST Laboratory, UAE
Tangier, Morocco
Abderrahim Ghadi 1
1 LIST Laboratory, UAE
Tangier, Morocco
Mohamed Ghailani 2
2 LabTIC Laboratory, UAE
Tangier, Morocco
Hajar Saoud 1
1 LIST Laboratory, UAE
Tangier, Morocco
SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.
A family history: If several parents of the woman has been diagnosed with breast cancer, especially at a
younger age, the risk to develop breast cancer increases.
Genetic factors: Some genetic mutations increase the risk of breast cancer.
2.2.2 Characteristics of the individual Obesity: The obesity increases the risk of breast cancer.
Having period in early age: Having the period before the age of 12 increases the risk of breast cancer.
Late menopause: Woman that started menopause at a later age, she is more likely to develop breast cancer.
Having the first child in old age: Women who give birth to their first child after the age of 30 may have an
increased risk of breast cancer.
Women who have never been pregnant: The fact of not having a child increases the risk of developing breast
cancer.
Hormone replacement therapy: (Estrogen and progesterone) increases the risk of having breast cancer
after 5 years of treatment
Drinking Alcohol: Drinking alcohol increases the risk of breast cancer.
3. RELATED WORKS Several researches have been carried out in this field, they used
data mining and machine learning algorithms to classify and
diagnose the different types of cancers.
Hiba Asria et al [3] to predict breast cancer they compared the
performance of four classifiers: Support Vector Machine (SVM),
k-nearest neighbors algorithm (Knn), Decision Tree (C4.5) and
Naive Bayes (NB) on Wisconsin breast cancer dataset with the
platform WEKA. The results showed that the classifier that
produced the highest accuracy is the SVM with accuracy of
97.13%.
Akinsola Adeniyi F et.al [4] in this research they evaluated three
classification algorithms C4.5, Multi-layer perceptron and Naive
Bayes using the WEKA environment. The results showed that the
best algorithm is C4.5 with the highest accuracy of 93.9854% in
breast cancer classification.
PG Student et.al [5] tried to classify the SEER breast cancer
database into two categories "Carcinoma in situ" and "Malignant
potential" using C4.5. They obtained the accuracy of 94% in the
training phase and 93% in the test phase.
Dana Bazazeh and Raed Shubair [6] have studied and compared
three machine learning algorithms: Random Forest (RF), Vector
Support Machine (SVM), Bayesian Networks (BN) on Wisconsin
Breast Cancer dataset. The results showed that Support Vector
Machine (SVM) has the best performance in accuracy, specificity,
and precision, and Random Forest (RF) classifies tumors
correctly.
Chintan Shah and Anjali G. Jivani [7] in this research they
examined three machine learning algorithms: Decision Tree,
Bayesian Network and K-Nearest Neighbor Algorithms to predict
Breast Cancer. The criteria on which they are based the lowest
calculation time and the precision. The algorithm that produced
the best result is Naïve Bayes with the accuracy of 95.9943% and
the calculation time 0.02.
Zahra Nematzadeh et.al [8] in this research they compared four
machine learning algorithms: Decision Tree, Naive Bayes, Neural
Network and Vector Support Machine algorithm on the breast
cancer databases original and prognostic Wisconsin breast cancer.
In this paper they studied the impact of k in k-fold cross validation
to obtain the best accuracy.
S.Kharya and. al [9] in this paper they have studied and compared
theoretically four techniques: Support vector machine, Bayesian
belief network and Artificial neural network for the prediction of
breast cancer. They concluded that Bayesian belief network is the
beast technique that can be used to predict breast cancer.
4. DATA MINING Data mining [10] means the extraction of knowledge and
information from massive data coming from different sources.
Also it can be defined as a step in the Knowledge Discovery from
Data (KDD) knowledge discovery process that can be presented
in the following way:
Figure 1: Processes of knowledge Discovery from data (KDD)
[10].
Or:
Data cleaning: this task aims to remove noise from data.
Data integration: this task combines data from different sources.
Data selection: this task aims to extract relevant data for the analysis task.
Data transformation: the role of this phase is to perform the operations of summary and aggregation to transform
data into forms appropriate for doing mining task.
Data mining: it’s an important phase that aims to extract knowledge by applying data mining technics.
Pattern evaluation: the aim of this phase is to identify interesting models where the representation of
knowledge is based on certain measures of interest.
5. MACHINES LEARNING ALGORITHMS
5.1 Bayes Network Bayes Networks [11], also called (Bayesian belief networks), are
methods that are widely used for modeling and presenting
knowledge of uncertain domains. Bayes Network is a directed
acyclic graph (DAG), consisting of several nodes that represent
variables and arcs that represent the probabilistic dependencies
between those variables. Figure 2 shows an example of a simple
Bayes Network.
SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.
Figure2: simple structure of a Bayes Network [11].
All variables are conditionally independent from their not progeny
because each node has its own parent nodes. So Bayes Network
calculates joint probability of a set of variables {Y1, Y2,….., Yn} by decomposing it into product of conditional probability
distributions on each variable given its parents in the graph. So the
joint probability of all the nodes can be written in the following
way:
( ) ∏ ( ( ))
Where ( ) represent the set of parent variables.
5.2 Support Vector Machines (SVM) Support Vector Machines [10] are supervised learning models that
can be used in prediction also in the classification of the linear and
nonlinear data. The principle of the SVM algorithm is to use a
non-linear mapping to transform the original learning data into a
larger dimension. In this new dimension, it looks for the linear
hyperplane of optimal separation. SVM algorithm aims to find a
hyperplane with the largest margin named maximum marginal
hyperplane (MMH) to be more accurate in the classification of
future data i.e. it looks the shortest distance between the MMH
and the closest training tuple of each class.
Figure 3: The two possible separation hyperplanes and their
associated margins [10].
5.3 k-nearest neighbors algorithm (Knn) The k-nearest neighbors classifiers [10] are based on analog
learning, i.e. they compare a given test tuple with similar learning
tuples. They classify the tuples using more than one nearest
neighbor. The principle of the k-nearest neighbors classifier is that
it looks in the space model for the K test tuples closest to the
unknown tuple. These tuples are named (k nearest neighbors) of
the unknown tuple.
To know the K tuples closest to the tuple tests, it is necessary to
calculate and compare the distances. Several distance calculation
methods exist. The Euclidean distance is one of them. The
Euclidean distance between two points or tuples, X1 = (x11, x12,
..., x1n) et X2 = (x21, x22, ..., x2n), are :
( ) √∑( )
The definition of the most appropriate K-nearest-neighbor is done
manually starting with K = 1 and calculating the classification
error rate. The appropriate k gives the minimal error.
5.4 Decision Tree (C4.5) The decision tree algorithm [12] is a classification algorithm that
is similar to an organizational chart where the internal nodes (not-
leaf) of a decision tree represent the tests on the attributes, the
branches represent the results of the test and the external nodes
(leaves) represent the predicted results. At each node, the
algorithm chooses the best attribute to partition the data into
individual classes. At the end a tree will be built when selecting
subsets from the data provided.
Figure 4: Example of a decision tree [12].
The figure shows an example of a decision tree that aims to
investigate the possibility of a customer buying a computer. Each
internal (not-leaf) node represents a test on an attribute. Each leaf
node represents a class (either buy a computer = yes, or buy a
computer = no).
There are two ways to build a decision tree, from top to bottom
and from bottom to top. There are several types of descendant
decision tree, for example CART, C4.5 and ID3
5.5 Logistic regression Logistic regression [13] is one of the generalized linear models
much used in machine learning. Logistic regression predicts the
probability of a result that can take two values from a set of
predictor variables. Logistic regression is mainly used for
prediction and also to calculate the probability of success.
5.6 Artificial Neural Network (ANN) Neural Network [14] can be defined as a reasoning model based
on the human brain. An Artificial Neural Network is a set of
processors (or neurons) very simple, very interconnected by
weighted connections to pass signals from one neuron to another
and they operate in a parallel manner. These neurons are similar to
the biological neurons of the human brain. Artificial Neural
Network consists of three layers: input layer, output layer and
between them they are extra layers called hidden layers.
SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.
Figure 5: Architecture of an Artificial Neural Network [14].
6. DESCRIPTION OF THE DATASET The database that we used in this research is the Wisconsin breast
cancer dataset available in UCI machine learning repository [15].
It contains 699 records (458 benign tumors and 241 malignant
tumors). It is composed of 11 variables 10 predictor variables and
one result variable that shows whether the tumor is benign or
malignant. The predictive attributes vary between 0 and 10. The
value 0 corresponds to the normal state and the value 10
corresponds to the most abnormal state.
The table above presents the description of the 11 attributes of the
Wisconsin breast cancer dataset:
Attributes Description
Id A code for the identification of each line
Clump Thickness The benign cells are grouped in
monolayers whereas the cancer cells are
grouped in multilayers.
Uniformity of
Cell Size
The size of the cancer cells.
Uniformity of
Cell Shape
The shape of the cancer cells.
Marginal
Adhesion
Cancer cells can lose their tights; this is a
sign of malignant cancer.
Single Epithelial
Cell Size
Single Epithelial Cell Size.
Bare Nuclei The nuclei are not surrounded by the rest
of the cell in benign tumors.
Bland Chromatin Cancer cells have coarse chromatin.
Normal Nucleoli In cancer cells, the nucleoli are
transforming into protuberant, but the
nucleoli are small.
Mitoses Cell growth.
Class If the cancer is a benign tumor or
malignant tumor.
Table 1: The 11 attributes of the breast cancer database.
7. EXPERIMENTATION The platform that we used to apply the machine learning
algorithms on the breast cancer database is WEKA [16], because
WEKA is a collection of open source machine learning
algorithms, which allows realizing the tasks of data mining to
solve real world problems. It contains tools for data
preprocessing, classification, regression, grouping, and
association rules. Also it offers an environment to develop new
models.
The figures below show the performance of machines learning
algorithms using the WEKA environment:
7.1 Bayes Network
Figure 6: Results of running Bayes Network (BN) on the
database.
7.2 Support Vector Machines (SVM)
Figure 7: Results of running Support Vector Machine (SVM)
on the database.
7.3 k-nearest neighbors algorithm (Knn)
SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.
Figure 8: Results of the execution of k-nearest neighbors
algorithm (Knn) on the database.
7.4 Decision Tree (C4.5)
Figure 9: Results of the execution of Decision Tree (C4.5) on
the database.
7.5 Logistic regression
Figure 10: Results of running Logistic regression on the
database.
7.6 Artificial Neural Network (ANN)
Figure 11: Results of running the Artificial Neural Network
(ANN) on the database.
8. RESULTS AND EVALUATION
8.1 K-fold Cross-validation To evaluate the performance of machine learning algorithms
based on breast cancer data we used the K-fold cross validation
test method. This method aims to divide the database in two sets,
the training data to run the model and the test data to evaluate the
performance of the model. This is the most used method in the
evaluation of machine learning techniques.
8.2 Evaluation We will evaluate the performance of the compared methods at the
level of accuracy, the time taken, well classified instances, wrong
classified instances:
Methods accuracy well
classified
instance
wrong
classified
instance
time taken
BN 97.2818
%
680 19 0.04
SVM 97.2818
%
680 19 0.08
LR 96.5665
%
675 24 0.04
ANN 95.422 % 667 32 1.42
KNN 95.279 % 666 33 0
DT
(C4.5)
95.1359
%
665 34 0.03
Table 2: Evaluation of the performance of the methods
Methods well
classified
instance
for
benign
well
classified
instance
for
malignant
Precision
for
benign
Precision
for
malignant
Average
BN 442 238 99.3% 93.7% 97.4%
SVM 443 237 99.1% 94% 97.4%
LR 446 229 97.4% 95% 96.6%
ANN 442 225 96.5% 93.4% 95.4%
KNN 445 221 95.7% 94.4% 95.3%
DT
(C4.5)
438 227 96.9% 91.9% 95.2%
Table 3: Evaluation of the performance of the methods for
each type of cancer
Also by evaluating:
Kappa Statistic: measures the agreement between observers during a qualitative coding in categories.
Mean Absolute Error: The Mean absolute error is the average of the absolute differences between prediction
and actual observation.
Mean Squared Error: The mean squared error is the average of squared differences between prediction and
actual observation.
SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.
Root relative squared error.
Relative Absolute Error.
Methods Kappa
Statistic
Mean
Absolute
Error
Mean
Squared
Error
Root
relative
squared
error
Relative
Absolute
Error
BN 0.9406 0.0289 0.1623 34.1484
%
6.3954
%
SVM 0.9405 0.0272 0.1649 34.6872
%
6.0141
%
LR 0.924 0.0486 0.1667 35.0808
%
10.7588
%
ANN 0.8987 0.0497 0.2031 42.7227
%
11.0061
%
KNN 0.8948 0.0473 0.2128 44.7723
%
10.4642
%
DT 0.893 0.0637 0.2142
45.0739
%
14.1031
%
Table 4: Evaluation of the errors of the methods
8.3 Confusion matrix The confusion matrix contains information about real
classifications or (current) and predicted:
predicted benign
predicted
malignant
actual
benign
TP (True
Positives)
FN (False
Negatives)
actual
malignant
FP (False
Positives)
TN (True
Negatives)
Table 5: Matrix of confusion
TP: the cases predicted as benign tumors, they are in fact benign
tumors.
TN: the cases predicted as malignant tumors, they are in fact
malignant tumors.
FP: the cases predicted as benign tumors but in the reality they are
malignant tumors.
FN: the cases predicted as malignant tumors but in the reality they
are benign tumors.
From the confusion matrix we can calculate:
accuracy =
Sensitivity =
Specificity =
The diagram in Figure 12 shows the accuracy, sensitivity and
specificity of the classifiers.
Figure 12: Comparison of the Accuracy, Sensitivity and
Specificity.
9. CONCLUSION To conclude, in this paper we examined six machines learning
techniques for the diagnosis of breast cancer. The database that
we used is Wisconsin breast cancer dataset with WEKA
simulation environment. The algorithms that gave the highest
accuracy are Bayes Network and Support Vector Machine (SVM)
with an accuracy of 97.2818%. But the Bayes Network classifier
will be the best algorithm for the diagnosis and classification of
breast cancer compared with the vector support machine (SVM)
because it produced a minimal time comparing it with the time
taken by the support vector machine (SVM). In future research we
will propose a hybrid approach that aims to improve the accuracy
of the classifier Bayes Network by combining it with the methods
of features selection.
10. REFERENCES [1] « Breast cancer statistics », World Cancer Research Fund,
22- August -2018. Available on:
https://www.wcrf.org/dietandcancer/cancer-trends/breast-
cancer-statistics.
[2] K. Ganesan, U. R. Acharya, C. K. Chua, L. C. Min, K. T. Abraham, and K.-H. Ng, « Computer-Aided Breast Cancer
Detection Using Mammograms: A Review », IEEE Rev.
Biomed. Eng., vol. 6, p. 77‑98, 2013.
[3] H. Asri, H. Mousannif, H. A. Moatassime, et T. Noel, « Using Machine Learning Algorithms for Breast Cancer
Risk Prediction and Diagnosis », Procedia Comput. Sci., vol.
83, p. 1064‑1069, 2016.
[4] IEEE Staff and IEEE Staff, 2009 IEEE International Advance Computing Conference. 2009.
[5] K. Rajesh and S. Anand, « Analysis of SEER dataset for breast cancer diagnosis using C4. 5 classification
algorithm », Int. J. Adv. Res. Comput. Commun. Eng., vol.
1, no 2, p. 2278–1021, 2012.
[6] D. Bazazeh and R. Shubair, « Comparative study of machine learning algorithms for breast cancer detection and
diagnosis », in Electronic Devices, Systems and Applications
(ICEDSA), 2016 5th International Conference on, 2016, p.
1–4.
[7] C. Shah and A. G. Jivani, « Comparison of data mining classification algorithms for breast cancer prediction », in
Computing, Communications and Networking Technologies
(ICCCNT), 2013 Fourth International Conference on, 2013,
p. 1–4.
0,86
0,88
0,9
0,92
0,94
0,96
0,98
1
BN SVM LG ANN KNN DT
Précision
Sensitivity
Specificity
SCA’2018, October 10-11, 2018, Tetouan, Morocco H.SAOUD et al.
[8] Z. Nematzadeh, R. Ibrahim, and A. Selamat, « Comparative studies on breast cancer classifications with k-fold cross
validations using machine learning techniques », in Control
Conference (ASCC), 2015 10th Asian, 2015, p. 1–6.
[9] S. Kharya, D. Dubey, and S. Soni, « Predictive Machine Learning Techniques for Breast Cancer Detection », IJCSIT
Int. J. Comput. Sci. Inf. Technol., vol. 4, no 6, p. 1023–1028,
2013.
[10] J. Han and M. Kamber, Data mining: concepts and techniques, 2. ed., [Nachdr.]. Amsterdam: Elsevier/Morgan
Kaufmann, 2010.
[11] A. Mahmood, « Structure Learning of Causal Bayesian Networks: A Survey », p. 6.
[12] J. Han and M. Kamber, Data mining: concepts and techniques, 3rd ed. Burlington, MA: Elsevier, 2011.
[13] H. Yusuff, N. Mohamad, U. Ngah, and A. Yahaya, « Breast cancer analysis using logistic regression », Int. J. Res. Appl.
Stud., vol. 11, 2012.
[14] M. Negnevitsky, Artificial intelligence: a guide to intelligent systems, 2nd ed. Harlow, England ; New York: Addison-
Wesley, 2005.
[15] « UCI Machine Learning Repository: Breast Cancer Wisconsin (Original) Data Set ». Available on:
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wiscons
in+(original).
[16] « Machine Learning Project at the University of Waikato in New Zealand ». Available in:
https://www.cs.waikato.ac.nz/ml/index.html.
[17] Saoud H., Ghadi A., Ghailani M. (2018) Analysis of Evolutionary Trends of Incidence and Mortality by Cancers.
In: Ben Ahmed M., Boudhir A. (eds) Innovations in Smart
Cities and Applications. SCAMS 2017. Lecture Notes in
Networks and Systems, vol 37. Springer, Cham