Week 5.2

jamiahr23

Pdfarticle.pdf

Home >Psychology homework help >Week 5.2

2 0 2 0 I

n te

rn a ti

o n a l

C o n fe

re n c e o

n C

o m

p u te

r, C

o n tr

o l,

E le

c tr

ic a l,

a n d E

le c tr

o n ic

s E

n g in

e e ri

n g (

IC C

C E

E E

) | 9 7 8 -1

-7 2 8 1 -9

1 1 1 -9

/2 0 /$

3 1 .0

0 ©

2 0 2 1 I

E E

E |

D O

I: 1 0 .1

1 0 9 /I

C C

C E

E E

4 9 6 9 5 .2

0 2 1 .9

4 2 9 6 2 0

2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE)

Comparison between Neural Networks and Binary logistic Regression for Classification

Observation (Case Study: risk factors for cardiovascular disease)

Eyas Gaffar Abdelraheem Osman2 Associated professor, Shaqra

University, Saudi Arabia

eyas-gaffar@su. edu. sa

Ebtehag Mustafa Mohammed 1 Faculty o f economic and rural development, University o f Gezira, PhD

Wad Medani, Sudan ebtehagmustafa22 @gmail. com

Abstract— the distinction between the artificial neural network method and the logistic regression method was discussed in this study as one of the methods suggested to be used in dual-data response. That is for preference between the two used methods, we used the proportion of misclassified observations, model accuracy and the area under the curved ROC as a criterion to compare between the two methods. Accordingly, This hospital-based case-control study involved 750 cardiovascular disease cases and 50 controls all recruited from Madani Heart Centre in Sudan, in 2019, The study aimed at knowing the most important risk factors for cardiovascular disease, and comparison between the Binary Logistic model and the Neural Networks models, also recognition of the best statistical approaches between the two methodologies for processing such data. To process the data, the study used the (SPSS) version 25.The main results that the study reached that the two used methods are similar regarding the significance of both the effect and the importance of the independent variables considered in the analysis, but the method of artificial neural networks gained a better classification proportion than the Binary Logistic Regression model .The most important recommendations of the study that making use of the statistical methods and generalizing the application of both Neural Networks and Logistic model in all fields of knowledge.

Keywords— Artificial Neural Network, logistic regression, cardiovascular disease, dual-data response.

I. In t r o d u c t i o n

Cardiovascular diseases (CVDs) are one o f the leading causes o f death all over the world, and by 2030, more than 23 million people are expected to die from CVD according to

the World Health Organization [1]. It has a serious socio- economic effect on individuals, families, and societies as far as o f healthcare costs, work absenteeism, and national productivity. Four risk factors (tobacco use, inordinate liquor utilization, less than stellar eating routine and absence of actual work) are related with four illness bunch (cardiovascular infection, malignant growth, ongoing aspiratory sickness and diabetes [2]. Six of the best ten driving reasons for death in 2012 were Non-Communicable Diseases (NCDs), including the best three diseases (ischemic coronary illness, stroke and ongoing obstructive pneumonic infection) [3].

Machine learning algorithms:

A. Artificial Neural Network ANNs are computing systems ambiguously excited by the

biological neural networks that compound animal brains' [4]. A neural network is made up of neurons which are simple and interconnected processors. The neurons are connected with one another by weighted connections over which signs can pass. The cycle comprises o f data assortment, examination and handling, network structure configuration, number of hidden layers and units, initializing, training the

network, network simulation, weights/bias adjustments, and testing the network. Artificial neural networks are utilized in various fields to manage enormous arrangements of information, frequently giving valuable investigations that permit to forecast and recognition o f new information. Artificial neural networks figure underlying information through a cycle of learning and preparing. Information regularly utilized by these designs has nonlinear connections among data sources and yields. They are utilized in applications, for example, discourse acknowledgement, imaging, control, assessment, enhancement, and host of different things. They are additionally applied in certifiable applications in the territories of account, clinical, business, mining, etc. [5]. The figure below shows the architecture of ANN which used in this study.

Authorized licensed use limited to: Walden University. Downloaded on June 25,2021 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.

B. . Logistic regression : Logistic Regression sometimes called the logistic model or logit model, analyses the

relationship between multiple independent variables and

a categorical dependent variable, it is a method for

fitting a regression curve, y =f(x) when y consists o f

binary-coded (0, 1- -failure, success) data. When the

response is binary (dichotomous) variable and x is

numerical, logistic regression fits a logistic curve to the

relation betwixt these variables. The logistic curve is an

S-shaped or sigmoid curve, often used to model

population growth [6]. A logistic curve starts with slow,

linear growth, followed by exponential growth, which

then slows again to a stable rate.

C. Decision Tree: Decision (Regression) Tree is a tree-like structure that classifies instances by sorting them based

on the estimations o f the variables. Every hub in a

choice tree addresses a variable in a guide to be ordered,

and each branch addresses a worth that the hub can

expect. Occurrences are ordered beginning at the root

hub and arranged dependent on the values of the

variables. The variable that best divides the dataset

would be the root node of the tree. Internal nodes (or

split nodes) are the decision-making part that makes a

decision, based on multiple algorithms, and to visit

subsequent nodes. The split process is terminated when a

user-defined criterion is reached. The paths from root

nodes to the leaf nodes represent classification rules.

D. Random Forest: Random Forest is an ensemble model constating o f multiple regression trees like in a forest.

Random forest combines several regression trees,

prepares every one on a marginally unique arrangement

of the dataset examples, parting hubs in each tree

thinking about a predetermined number o f the factors.

The last forecasts o f the arbitrary backwoods are made

by averaging the expectations o f every individual tree,

which enhances the prediction accuracy for unseen data

E. k-Nearest Neighbor: k-Nearest Neighbour (kNN) is one of the most basic and non-parametric algorithms, it does

not make any assumptions about the distribution o f the

underlying data. The algorithm is relies on the principle

of Euclidean distance which is the cases inside a dataset

for the most part exist in closeness to different examples

that have comparative properties. In the event that the

cases are labelled with a grouping name, at that point ,

the estimation o f the mark o f an unclassified occasion

can be dictated by noticing the class of its closest

neighbours [7].

II. MATERIALS AND METHODS

This study endeavoured to: determine the most important variables that determine the risk factors for cardiovascular diseases, comparison of the two models o f artificial neural networks and the binary logistic regression in differentiating between a group of patients and those without the disease. Identify the best statistical method among the two mentioned methods for classifying study data and for processing such data.

This hospital-based case-control study relied on

observations o f 800 patients involved 750 CVD cases and

50 controls all recruited from Madani Heart Centre in

Sudan, in 2019 and there were selected randomly, all of

them aged 15 years and above. All subjects were

interviewed face-to-face to fill in a questionnaire that

covered many CVD-related variables (containing modifiable

and non-modifiable risk factors for CVDs, Non-modifiable

risk factors such as age and sex, and modifiable such as

state, residence, obesity).

Data was collected with consideration of ethical aspects, as

approval was obtained from the Medani Heart Centre and

the University o f Gezira.

After collecting the data, the data has been coded and

organized and has been examined using SPSS version 25 for

windows, to achieve the desired objectives of this study,

data were investigated utilizing the method of the binary

logistic model and the method of multilayer perception

network which are feed forward neural networks and The technique which used in training here is supervised

learning with Back propagation (BP) algorithm. To equate the results o f the binary logistic model and network model

the proportion o f misclassified observations was used for

each model separately.

Related Works: -Aravind Akella and Vibhor Kaushik in their study, applied

six different machine learning (ML) algorithms to predict

the presence o f cardiovascular diseases amongst patients

listed in an openly available dataset, all six ML algorithms

accomplished correctnesses more noteworthy than 80%, with the "Neural Network" calculation accomplishing

exactness more prominent than 93%. The review

accomplished with the "Neural Network" model is likewise

the most noteworthy o f the six models (0.93). Also, five of

the six calculations brought about fundamentally the same

as AUC-ROC bends. The AUC-ROC bend compared to the

"Neural Network" calculation is somewhat more extreme

suggesting higher "genuine positive rate" accomplished with

this model. In this research, they demonstrated that ML

algorithms can be applied with high accuracy and recall to

detect the presence o f CAD using a publicly available

dataset.

-The research o f Simon Nusinovici et al, 2020 expected to

evaluate the exhibition of (ML) calculations and to contrast

them and strategic relapse for the expectation o f danger of

cardiovascular infections (CVDs), ongoing kidney sickness

(CKD), diabetes (DM), and hypertension (HTN) and in an

imminent companion study utilizing straightforward clinical

indicators. The aftereffects of the examination demonstrated

that Logistic relapse, angle boosting machine, and neural

organization were deliberately positioned among the best

models, Additionally the examination reasons that Logistic

relapse yields as great execution as ML models to anticipate

the danger of major ongoing illnesses with the low

occurrence and basic clinical indicators. The study Suggest

that traditional regression models should continue to have a

key role in disease hazard expectation and further studies

are needed to confirm this result for different settings and

study characteristics [8].

- Aiguo Wang et al, 2014 results demonstrate that

integration of logistic regression and artificial neural

networks provides an effective method in the determination

o f risk factors and the forecasting o f hypertension, as well as

a general approach for the prediction of other chronic

diseases [9].

Authorized licensed use limited to: Walden University. Downloaded on June 25,2021 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.

According to K. Uma Maheswari and J. Jasm ine, 2017, the

upside o f logistic regression is the interpretability o f model

indicators and usability. The benefit of the neural network is

it requires less formal factual preparing to create and can

verifiably identify complex non-linear relationships among

dependent and independent factors. The joining o f logistic

regression and neural network gives the novel methodology

in predicting a person's heart disease. The future work can

be reached out fo r longitudinal investigations of the sick

persons and to enhance the preciseness in Predicting heart

disease [10].

III. RESULTS

A. Results o f binary logestic model After collecting the data, the data has been coded and

organized and has been analyzed using the Statistical

Package for Social Science (SPSS), We used binary logistic

regression, To assay the multiplicative effect o f Pathological

and socio-economic and demographic variables as

explanatory on cardiovascular disease status as a dependent

variable. Patients were separated into two categories: (1)

patients who have heart disease are labelled 1. (2) Patients

who have not have heart disease are labelled 0. A number of

steps were utilized to estimate logistic regression. First, we

evaluated the generic model. Second, determine the

significance o f each explanatory variable .third, the

reckoning o f predictive accuracy.

Table (1) below shows the Omnibus Test of Model

Coefficients used to check that the new model (with

explanatory variables included) is an improvement over the

baseline model (does not include explanatory variables). It

utilizes Chi-square tests to check whether there is a

significant difference between the Log-likelihoods of the

baseline and the new model. The statistics for the Step,

Model and Block are the same because we have not used

stepwise logistic regression or blocking. The table shows

that Chi-Square has 9 degrees o f freedom, a value of

224.829 and probability o f p = .000, which reflect the

precision of the model improvement when we add

explanatory variables respectively, as the overall chi-

squared test hypotheses are : H0: Bt=0 , Hi : Bt^0 (Minimum of one coefficient) so H0is rejected since p-value = 0.000.

TABLE 1. Omnibus Tests o f Model Coefficients

Chi-square f S ig Step 1 Step 224.829 9 0.000

Block 224.829 9 0.000

Model 224.829 9 0.000

The model statistics in the table (2) below provides the -2

Log-likelihood (-2LL), Cox & Snell R Square and

Nagelkerke R Square values for the full model. The -2 LL

value for the model (149.237) which tell us that there was

huge a diminishing in the - 2LL, for example that the new

model (with informative variable) has a fundamentally

preferable fit over the model with just constant. Cox and

Snell's R -Square attempt to imitate R- squared based on

likelihood and (usually less than 1). Here it is indicating that

(2.45%) o f the variety in the dependent variable is clarified

by the logistic model. Nagelkerke R Square is a more

reliable measure of a relationship. It is indicating that

(6.55%) o f the variety in the dependent variable is clarified

by the logistic model

TABLE 2_______ Logistic Regression’s Model Summary step -2loglikelihood Cox&snell R square

Nagelkerke R square

1 149.237 .245 .655

Table (3) below shows the Hosmer and Lemeshow test. H -

L test of the model fit which proposes that the model is a

solid match to the data.From the table Chi-square has 8

degrees o f freedom, a value of 6.871 and probability of P =

.0.551 which is more than 0.05 suggesting that the model

was best to the data.

TABLE .3 Hosmer and Lemeshow Test step Chi-square d f Sig.

1 6.871 8 .551

Table (4) below shows that the cases where the observations

of the dependent variable (heart disease status) were 1 or 0

respectively have been correctly predicted.

TABLE .4 Classification Table O b s e r v e d P r e d i c t e d

N o n - h e a r t p a t i e n t

H e a r t p a t i e n t

P e r c e n t a g e c o r r e c t

P a th o l o g ic a l

c a s e

N o n - h e a r t

p a t ie n t

26 2 4 5 6 .0

H e a r t p a t ie n t 9 741 9 8 .5

O v e r a ll

P e r c e n t a g e

9 5 .9

From the table, the columns are the two predicted

estimations of the dependent, while the rows are the two

observed (actual) values of the dependent. In a perfect

model, all cases will be on the diagonal and the overall per

cent correct will be (100%).

Also, According to the above table, overall (95.9%) were

correctly classified (741+26)/800=0.959).while 33 cases

were classified incorrectly

Logistic regression parameterization: TABLE .5_____ Variables in the Equation

Variables B S.E Wald d f Sig. Exp (B)

Renal disease 1.971 .818 5.806 1 .016 7.177

Using oil cooking 8.06 .468 2.973 1 .085 .446

No of meals that has been eaten with oil per day

.021 .019 1.144 1 .285 1.021

passive smoking .676 .668 1024 1 .311 14.77

Drinking alcoholic 2.692 1.14

5.494 1 .019 14.77

Operating heart injury

.500 .582 .739 1 .390 1.65

Level of education .115 .169 .459 1 .498 .892

Smoking .140 .017 65.05

1 .00 6.69

Constant 4.55 3.34 9

1.849 1 .174 .011

The parameter estimate coefficient B in the table (5)

summarizes the effect o f each predictor. the ratio of the

coefficient to its standard error squared equals the wald

statistic, the standard explaining of the logistic regression is

that for a unit change in the predictor variable, the logistic

regression of outcome relative is relied upon to change by

its particular coefficient estimate which is log-odds units,

given that different factors in the model are held steady.

From the table, the Wald statistics to every independent

variable confirms the significant Confirms a significant

impact on the status of cardiovascular diseases. The result of

regression for renal disease, there is Strong significant with

large impact (Wald = 5.806, df=1 p=.016), “Sig.” is a p-

Authorized licensed use limited to: Walden University. Downloaded on June 25,2021 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.

Network Informationvalue of significance test of beta. Usually, the coefficients

which p-values are below 0,05 are considered to be

significant. Based on our output, 3 explanatory variables are

significant. The main source leading to CVDs is drinking

alcohol with (14.8) times odds compared to other CVDs

factors, followed by renal diseases with (7.17 ) times odds,

likewise smoking with ( 6.7 ) times odds.

The equation of the line found here is :

Ln [ B (* ) 1= 4.55+ 1.971 x, + 2.692x2+.140xx3 i - b \ x ) 1 2 3

Where: x 1 represents (renal disease), x 2 represents drinking

alcoholic and x3represents smoking

“S.E”-s are standard errors related to the coefficients. The

standard error is used to test whether the parameter is

significantly different from 0 or not. Standard errors are also

utilizing in the calculation of the Wald statistic.

If we look again at the previous table, the sign of a

parameter of the educational level takes a negative value,

this means that the more a person receives a higher

education, the less likely they are to have cardiovascular

disease.

Roc curve:

Table (6) below shows that. The magnitude of detoured is

0.984 (0.976, 0.991). Also, this region is significantly

different from 0.5 since the p-value is 0.000 meaning that

the logistic regression classifies the group significantly

better than by chance.

TABLE .6____ ̂Curve Region T est result Predicted probability Asymptotic 95% variables

Asymptotic sig confidence level

Area std. Error

0.984 1 0.004 0.000 0.976 0.991

R O C C u r v e

Fig. 2 Roc curve

B. Results o f (ANN) s: There are many steps that were followed here in designing a

neural network (1): collecting data,(2)preprocessing data,

(3) building the network, (4): training, and (5) test the

model rendering .

Table (7) displays information about networks like hidden layer and output layer, activation function and error function

applied. The table shows that there are 9 factors and the Number of units in the input layer is 44. The numbers of units in the layer of output are 2 representing two categories of our variable- heart patients and non-heart patients. The architecture has included one hidden layer with 7 units, the function of activation is Softmax which also known as the logistic function or sigmoid

TABLE.7

factors 1 Renal disease 2 Using oil cooking

3 Number of meals that have been eaten with oil per day

4 Passive smoking

5 Drinking alcoholic

6 Operated heart injury

7 Level of education

8 Smoking

9 Symptoms

Hidden layers Number of units 44

Number of hidden layers

Number of units in hidden layers

Output layer Activation

function Softmax

Error function Cross-entropy

Information presented in Table 8 is a summary of the model. It shows Cross entropy error, per cent of incorrect predictions in training, testing. Cross entropy error in training is 0.043 while Cross entropy error in the testing sample is 0.027.

TABLE .8 (ANN)’s Model Summary

training Cross Entry error .043

Percent in correction

prediction 0.0%

Stopping rule used Ceiling number of epochs (663)exceeded

Training time 0:00:02.27

testing Cross Entropy error .027

Percent incorrect prediction 0.0%

It is clear from the previous table that the wrong classification in the training sample was 0.0% and the

misclassification in the test sample was 0.0% too and this indicates that the network has been excellently trained in categorizing new samples.

TABLE .9 Classification Rresult

Sample

Observed

Predicted

Percent Correct

Training

Non- heart patient

Heart patient

non- heart patient 36 0 100.0%

heart patient 0 512 100.0%

Overall Percent 6.6% 93.4% 100.0%

Testing

non- heart patient 14 0 100.0%

heart patient 0 236 100.0%

Overall Percent 5.6% 94.4% 100.0%

Per cent of incorrect predictions are taken from the classification table. As we can see in T able 9, the overall per cent of correct predictions is 100%. The per cent of incorrect predictions is 0% that is calculated as (100% - 100%). It means to say that a total of correct prediction and total of incorrect predictions makes 100 %. To understand this table, we shall first see in rows. In training, there are 36 non-heart patients. All of them are correctly predicted as a non-heart patient that is why; the overall per cent of correct predictions is 100%. In the next row, there are a total of 512 heart patients, 512 of them correctly predicted as heart

Authorized licensed use limited to: Walden University. Downloaded on June 25,2021 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.

patients. Therefore the per cent of correct predictions is 100%. In the last row o f the training section, overall per cent (6.6 % + 93.4 = 100%) represents the total number o f case analyzed in each category.

Table (10) below shows that the area under the curve is 1.0

Table. 10 Area under the ROC Curve for the Neural Network Model

Area

Pathological case Non- heart patient 1.0 Heart patient 1.0

Table (11) shows the normalized importance of each predictor variable. As the table depicts; the normalized importance o f smoking is the highest then drinking alcoholic then the renal disease.

TABLE . 11 Independent Variables Important

V ariable Importance Normalized importance

renal disease .073 11.8% using oil .019 3.1%

The number o f foods

consumed with oil

.067 10.8%

passive smoking .020 3.3% drinking alcoholic .076 12.2% operated heart injury .045 7.3% level of education .050 8.1% Symptoms .031 5.0% Smoking .618 100.0%

The following table shows the results and similarities and differences between the two models. logistic results indicated that the main elements of the risk of cardiovascular disease are drinking alcoholic followed by experiencing renal infections and afterward smoking, then, the neural network demonstrated that the main factor of the danger elements of cardiovascular sickness is smoking followed by drinking alcoholic and then renal disease.

TABLE . 12 Comparison between The Two Models

Item Model Neural network Logistic regression

Assumptions and conditions o f the dependent variable

Binary nominal variable

Assumptions and conditions o f the independent variables

None A mixture of quantitative and qualitative variables

The important of independent

variable for classification

Smoking is followed by drinking alcoholic and then renal disease

Drinking alcoholic is followed by renal

disease and then smoking

Area under the roc curve 1.0 .98

Correct prediction ratio 100% 95.9

Estimated final form coefficients

Shows the relative

importance of independent variables in an

automatic way

Indicates signals of the coefficient that reflect

the relationship between the dependent variable and the

independent variables

Re c o m m e n d a t i o n s

• Developing a database for gathering statistical data in the ministry o f health to obtain a real, realistic and extremely accurate data so that the results are good and satisfactory that helps us in developing that field to reach the desired goal.

• Making use o f the statistical methods in all fields of knowledge, and generalizing the application of both Neural Networks and Logistic model in all fields of knowledge.

• Attention should be paid to all factors that increase the risk of CVD, whether pathological or behavioral factors or others.

CONCLUSION

The study aimed at knowing extreme influential risk factors for cardiovascular disease, and comparison between the Binary Logistic model and the Neural Networks models, besides the recognition of the best statistical approaches between the two approaches fo r processing such data. The most important recommendations of the study that making use of the statistical methods in all fields of knowledge, and generalizing the application of both Neural Networks and Logistic model in all fields o f knowledge.

R e f e r e n c e s

[1] G. World Health Organization,Cardiovascular disease Geneva, Switzerland:World Retrieved from http://www.who.int/cardiovascular_diseases/en/; http://www.who.int/healthinfo/global burden_disease/ GBD_report2004updatepart2.pdf , 2013

[2] Lozano, R., Naghavi, M., Foreman, K., Lim, S., et al, Global and regional mortality from235 causes o f death for 20 age groups in 1990 and 2010: a systematic analysis for theGlobal Burden o f Disease Study 2010. The Lancet. [Online] 380 (9859), 2095-2128.Available from: doi:10.1016/S0140-6736(12)61728-0 ,2012 .

[3] Luke, A. , Are we facing a noncommunicable disease pandemic? - ScienceDirect.[Online]. 2017. Available from: https://www.sciencedirect.com/science/article/pii/S221 0600616301009 [Accessed: 25April 2018].

[4] Chen, Yung-Yao; Lin, Yu-Hsiu; Kung, Chia-Ching; Chung, Ming-Han; Yen, I.-Hsuan (January 2019). "Design and Implementation o f Cloud Analytics- Assisted Smart Power Meters Considering Advanced Artificial Intelligence as Edge Analytics in Demand- Side Management for Smart Homes". Sensors. 19 (9): 2047. doi:10.3390/s19092047. PMC 6539684. PMID 31052502.

[5] El-Shahat A. , Artificial Neural Network (ANN): Smart & Energy Systems Applications. Germany: Scholar Press Publishing; 2014. iSBN: 978-3-639-71114-1

[6] Eberhardt, L. L., & Breiwick, J. M. , 2012, Models for population growth curves. iSRN Ecology, 2012, 1-7. http://dx.doi.org/doi:10.5402/2012/ 815016

[7] Aravind Akella, Vibhor Kaushik, Machine Learning Algorithms for Predicting Coronary Artery Disease: Efforts Toward an Open Source Solution; doi: https://doi.org/10.1101/2020.02.13.948414

[8] Simon Nusinovici, Yih Chung Tham, Marco Yu Chak Yan, Daniel Shu Wei Ting, Jialiang Li, Charumathi Sabanayagam, Tien Yin Wong, Ching-Yu Cheng,Logistic regression was as good as machine learning fo r predicting major chronic diseases, Journal of Clinical Epidemiology, Volume 122,2020, ISSN 0895-4356, https://doi.org/10.1016/j.jclinepi.2020.03.002.

[9] A. Wang, N. An, Y. Xia, L. Li and G. Chen, "A Logistic Regression and Artificial Neural Network-

Authorized licensed use limited to: Walden University. Downloaded on June 25,2021 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.

Based Approach for Chronic Disease Prediction: A Case Study of Hypertension," 2014 IEEE International Conference on Internet o f Things (Things), and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing

(CPSCom), Taipei, 2014, pp. 45-52, doi: 10.1109/iThings.2014.16

[10] K. Uma Maheswari, Ms. J. Jasmine, 2017, Neural Network based Heart Disease Prediction, international journal of engineering research technology (IJERT) RTICCT - 2017 (Volume 5 - Issue 17)

Authorized licensed use limited to: Walden University. Downloaded on June 25,2021 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.