Statistic MRA

profileEchiri16
ReviewParameters-ModelBuilding.docx

Review Parameters: Model Building & Interpretation and Model Tuning

1. Model Building

a. Assessments and Rationale of Various Models Employed to Predict Loan Defaults

The z-score formula model was employed by Altman (1968) while envisaging bankruptcy. The model was utilized to forecast the likelihood that an organization may fall into bankruptcy in a period of two years. In addition, the Z-score model was instrumental in predicting corporate defaults. The model makes use of various organizational income and balance sheet data to weigh the financial soundness of a firm. The Z-score involves a Linear combination of five general financial ratios which are assessed through coefficients. The author employed the statistical technique of discriminant examination of data set sourced from publically listed manufacturers. A research study by Alexander (2012) made use of symmetric binary alternative models, otherwise referred to as conditional probability models. The study sought to establish the asymmetric binary options models subject to the extreme value theory in better explicating bankruptcy.

In their research study on the likelihood of default models examining Russian banks, Anatoly et al. (2014) made use of binary alternative models in predicting the likelihood of default. The study established that preface specialist clustering or mechanical clustering enhances the prediction capacity of the models. Rajan et al. (2010) accentuated the statistical default models as well as inducements. They postulated that purely numerical models disregard the concept that an alteration in the inducements of agents who produce the data may alter the very nature of data. The study attempted to appraise statistical models that unpretentiously pool resources on historical figures devoid of modeling the behavior of driving forces that generates these data. Goodhart (2011) sought to assess the likelihood of small businesses to default on loans. Making use of data on business loan assortment, the study established the particular lender, loan, and borrower characteristics as well as modifications in the economic environments that lead to a rise in the probability of default. The results of the study form the basis for the scoring model. Focusing on modeling default possibility, Singhee & Rutenbar (2010) found the risk as the uncertainty revolving around an enterprise’s capacity to service its obligations and debts.

Using the logistic model to forecast the probability of bank loan defaults, Adam et al. (2012) employed a data set with demographic information on borrowers. The authors attempted to establish the risk factors linked to borrowers are attributable to default. The identified risk factors included marital status, gender, occupation, age, and loan duration. Cababrese (2012) employed three accepted data mining algorithms, naïve Bayesian classifiers, artificial neural network decision trees coupled with a logical regression model to formulate a prediction model that employs a large data set. The study concluded that naïve Bayesian classifiers proved to be superior with an accuracy of prediction standing at 92.4 percent.

Focusing on the models of loan defaults in the case of SMEs as rare action, Rafaella (n.d.) employed comprehensive acute values degeneration. The study inferred that the logistic models had some downsides such as underestimation of the likelihood of loan default. Using binary GEV model to foresee the probability of loan default and established to perform better as compared to the logical regression model.

b. Model Building Problem and Variable Selection

An analyst is invariably faced with a wide spectrum of possible prospective regressors when dealing with practical problems. Of all these regressors only a small number are likely to be significant. Determining the suitable division of regressors for the model is referred to as the variable selection problem. Normally, there are two conflicting goals involved in formulating a regression model that encapsulates a subset of the obtainable regressors. On one hand, the analysis necessitates the model to contain as many regressors as feasible in an effort to ensure that the data content in these aspects can influence the predicted value (y). On the other hand, the model is expected to encapsulate as a small number of regressors as possible since the variance of the prediction rises as the number of regressors enlarges.

Rubbing out variables potentially brings about prejudice into the estimations of the coefficient of the maintained variables as well as the retorts. Model over-fitting, which implies to inclusion of variables with entirely zero regression coefficients in the data population in the model, does not establish prejudice when estimating population regression coefficient, in the event that the usual regression assumptions are adhered to. Nonetheless, there is need to ascertain that over-fitting does not bring about adverse co-linearity.

Variable selection process involves the following fundamental steps:

i. Indicating the maximum model in consideration

ii. Outlining the selection model criterion

iii. Specifying the strategy for variables selection

iv. Carrying out the indicated evaluation

v. Assessing the chosen model validity

I. Indicating the Maximum Model

The maximum model is considered to be the biggest model, implying one with the largest number of predictor variables, and is considered at every juncture of model selection process. The choice of the maximum model has particular constraints imposed on it ensuing from the certain data sample to be assessed. The most fundamental constriction is the fact that degrees of freedom errors ought to be positive. Consequently, or unvaryingly, with being the number of observation while represents the number of predictors, resulting to coefficient of regression inclusive of the intercept. Generally, it is plausible to obtain a large freedom of error degrees. This implies that as the sample size becomes smaller, the maximum model gets smaller. The biggest challenge is then is establishing the number of degrees of freedom that are required. The feeblest prerequisite is . An agreed thumb rule for regression is having not less that 5 or 10 observations for each predictor. In this case,

II. Criteria for Assessment of Subset Regression Models

There is availability of various criteria that can be employed in the assessment of the subset regression models. The utilized criterion is dependent on the intention of the model. Letting and represent the sum of squares regression and sum of squares residuals, correspondingly, for a model with terms, which implies regressors as well as as intercept term:

a. F-Test Statistic

This represents another practical principle for outlining the best model that is employed to compare the reduced and full models. The F- Test statistic is obtained as below:

It is possible to compare this statistic to an F- distribution using as well as freedom degrees. In the event that Fp is not significant, smaller variables model is used.

b. Coefficient of Determination

This is the quantity of the sufficiency of the regression model that has largely been utilized. Letting represent the coefficient of determination for model subset of terms, then

amplifies as increases and obtains a maximum value when . Subsequently, one employs this criterion through addition of regressors to the model up until there is an further variable generates a small increase in .

If we consider in that and the value of for the full mode is , any regressors variables subset that generates an larger than is referred to as the - adequate () subset. This implies that is not considerably dissimilar from the .

c. Mallow’s Cp Statistic

This approach is another candidate for selection criterion that involves Mallow’s. In this case

The criterion aids in establishing the various variables that need to be included in the best model, following the fact that it attains a value just about if is approximately equivalent to .

d. PRESS

An analyst can choose the subset regression model subject to a diminutive value of PRESS. Although PRESS has perceptive application, certainly for the forecast problem, it is not an easy function of the sum of squares residual, and formulating an algorithm for variable selection on the basis of this criterion is not clear-cut. This statistic, is nonetheless, potentially instrumental, in discriminating between models.

e. Logic Regression Model

This model is invariable employed to categorical variables in the model that assume only two probable outcomes implying success or failure. The logistic regression assumes the following form:

Computing the antilog of the above equation, an equation that can be utilized in the probability of the occurrence of an event is derived as follows:

represents the likelihood of the outcome desired or event. This model will be instrumental in forecasting the probability of loan default.

f. Lineal Regression Model

This model makes use of a statistical approach where the desired value is represented as a linear combination of sets of explanatory variables. In the event that the linear regression makes use of one independent variable, it is referred to as a simple linear regression. Denotation for a linear regression is as shown below:

Where

= Dependent variable

= intercept parameter

= Coefficient of regression (slope parameter)

= Error term

= Independent variable

III. Specifying The Strategy For Variables Selection

a. Possible Regression Procedure

This procedure calls for fitting of every possible equation of regression connected to each probable amalgamation of the independent variables. Assuming to be the intercept term and it is included in all equations, then the number of prospective regressors is , and the total equations to be computed and evaluated are . Consequently the number of equations to be evaluated rises swiftly with an augment in the number of prospective regressors.

b. Backward Elimination Procedure

This procedure starts with a model that encapsulates all the prospective regressors. Accordingly, the F-statistic is calculated for every regressor as if it were the final variable to be added to the model. The minimum partial F-statistics is evaluated against the pre-selected value (FOUT). As a way of illustration, if the minimum value of partial F does not exceed the FOUT, the regressor is eliminated from the model. At this point, a model with regressors is fit, and the new partial F-statistic for the resultant model is computed. The procedure is repeated until the smallest F value is greater or equal to the pre-selected cutoff value (FOUT).

c. Forward Selection Procedure

This process starts with assuming that there are no regressors in the model except for the cut-off. An attempt is made to determine the best possible subset through insertion into the model one at a time. At every stage, the regressor that largely partially correlates with, or the correspondingly highest F-statistic provided the other regressors in the model, is added to the model in the event its partial F- statistic is larger than the pre-selected starting point FIN.

d. Stepwise Regression Procedure

This regression procedure is an adapted version of the forward regression that allows the reevaluation, at every stage, of the variables integrated in the model in the preceding steps. A variable previously entered in the model may end up being redundant at higher steps due to its connection with other variables consequently incorporated to the model.

2. Model Performance Evaluation

Regression models, on one hand, involve prediction of incessant values through observing from a number of independent variables. Classification, otherwise known as regression pipeline involves three basic steps. First, the initial configuration of the model is carried out and the output is predicted subject to certain input. Second, the predicted value is evaluated against a target value as well as the model performance quantity computed. Third, there is iterative fine-tuning of the many model variables in an effort to obtain the most favorable value of the performance metric. To attain the optimal value of the standard involves different efforts and tasks for different constant performance standard. Regression deals with prediction of the aspect of the outcome variable at a certain time aided by other correlated independent variables. Unlike the classification action, the prediction task obtains outputs that are continuous in value within a specified range.

a. Prediction

Prediction model types deal with ratio or interval dependent variables, while classification types of models involve categorical, either ordinal or nominal, dependent variables. For loan defaults prediction models, the ratio dependent variables include: customer’s revenue, acquisition cost of customers, return on investment, and response time. Prediction models make use of regression, neural networks, and decision trees methods. Outline below are some of the evaluation methods for prediction models.

i. MAE/ MAD

MAE or MAD refers to mean absolute error or deviation which is obtained through the following expression:

ii. Average Error

This value is comparable to MAD apart from the fact that it keeps the error sign, such that positive errors cancel out with negative errors of similar magnitude. Average error provides an indication if the predictions obtained are under-predicting, average, or over-predicting the desired response. The Average error is obtained as follows:

iii. MAPE

MAPE stands for Mean Absolute Percentage Error and represents the measure that provides the score of how predictions deviate from the actual values in percentage.

iv. RMSE

The root-mean-squared error (RMSE) is similar to the standard error of prediction, apart from it is calculated on the validation data as opposed to the training data. It attains the same units as the predicted variable.

v. Total SSE

Total SSE is the total sum of the squared errors.

vi. Area Under the ROC Curve (AUC- ROC)

One of the popular metrics used in industries is the ROC curve having the biggest advantage that it is independent of the change in proportion of responders. Therefore, for each sensitivity there is a different specificity with the two varying as shown:

specificity, sensitivity

The plot between sensitivity and (1-specifity) indicates the ROC curve. (1-specificity) is referred to as false positive rate as well while sensitivity is also referred to as True Positive rate. For the current case the ROC curve is as follows:

model evaluation, ROC curve

A single point in the ROC plot will indicate a model which gives class as output. Since judgment needs to be taken on a single metric and not using multiple metrics, such models cannot be compared with each other. A model with parameters (0.2, 0.8) and a model with parameter (0.8, 0.2) can be resulting from the same model for instance hence these metrics should not be compared directly.

We were lucky in the case of probabilistic model to get a single number which was AUC-ROC. However a look at the whole is needed to make a conclusive decision. It is also possible for one model to perform better in one region and another better in other. Of the response rate, the ROC curve is almost independent on the other hand since it has two axes originating from columnar calculations of confusion matrix. For both x and y axis, the numerator and denominator will change on similar scale of response rate shift.

b. Classification

i. Confusion Matrix

A confusion matrix refers to a square matrix of NxN nature, where N represents the number of outcomes being predicted. For a confusion matrix, accuracy is considered as the fraction of the total predictions number that is accurate. Precision, also referred to as Positive Predictive Value, represents the percentage of the positive predictions that were identified accurately. On the other hand, Negative Predictive Value is considered to be the proportion of cases identified correctly that are negative. Specificity is taken to be the fraction of the actual negative outcomes that have been identified correctly. Below is an illustration of a Confusion Matrix:

Confusion Matrix

Target

Model

Positive

a

b

Positive Predictive Value

a/(a+b)

Negative

c

d

Negative Predictive Value

d/(c+d)

Sensitivity

Specificity

Accuracy = (a+d)/(a+b+c+d)

a/(a+c)

d/(b+d)

ii. Sensitivity

Ensuing from confusion matrix, sensitivity is obtained through the expression below

iii. Specificity

Also computed from the confusion matrix, the expression for specificity is as show:

3. Best Model Interpretation

Bank loan defaults are a rare occurrence but when such occurrence takes place it may result in incurring of loss. The daily operations of the banks can be affected by such extreme occurrences thus leading to adverse impacts on a country’s economy. Statisticians and analysts have invariably focused on this concern which has led to proposal of various models in addressing the problem. Some of the popular models with load defaults include standard discriminant model, the Z-score, and logistic regression models. This study prefers the use of logistic regression model in the assessment of bank loans defaults. Logistic regression model has been instrumental in credit risk evaluations in the financial setting. The primary benefit of logistic regression model is attributable to the fact that it is easily understood, easy implementation, and superior performance (Gilli & KeIlezi, 2000). Additionally, the model outmaneuvers linear regression due to the fact that it mitigates multiple concerns. For instance, linear regression attains a regression output that is negative or with a value greater than 1, which is impossible to obtain likelihood (Goodhart, 2011). Linear regression deals with this issue through provision of an incessant spectrum of values between 0 and 1 that maintaining the regression output to values between 0 and 1.

Previous studies have proposed models for the prediction of loan defaults through the use of two dissimilar classifiers which are Cox proportional hazard algorithm and logistic regression in an effort to predict customers who are likely to default on bank loans. This study relies on logistic regression coupled with random forest classifier in predicting likelihood of load defaulting.

a. Logistic Regression Model

Logistic regression (LR) model is a predictive technique that is largely employed in forecast and classification phenomena. In this model the desired variable is a non-linear function of likelihood of being positive (Thomas, 2000). In addition, the results of LR classification are sensitive to correlations that fall between the independent variables. Subsequently, the variables inserted in formulating the model ought not be sturdily correlated. It is assumed that the non-linearity of credit data diminishes the accuracy of the LR accuracy. It follows therefore; the primary goal of the LR model of credit scoring is to establish the conditional likelihood of every application innate in a particular category (Yap et al., 2011). Customers who are likely to default or those who are not likely to default on loan are assessed subject to the values of the descriptive variables of the loan application.

It is vital for every loan application to be allotted only one category of dependent variable. Nonetheless, the LR model restricts the attainment of the forecast values of dependent outcome variable to occur in the range between 0 and 1. Logistic regression is a popular technique of modeling that categorizes the loan applicants into two classes, through the use of a set of predictive variables (Akkoc, 2012). The following expression is the general representation of LR model.

-representing the likelihood of a customer being “good” with being the function of predictive variables (: age, : loan amount, : loan amount, and : professional class) indicating the characteristics of the loan applicant.

is the intercept, with = (1,….,4) indicating the coefficients linked to the respective ( = 1,…, 4).

stands for the default occurrence ()

is the term for errors.

It is pertinent to note that multi-colinearity is an unfavorable aspect of the logistic regression model. However, it is not a substantial concern since the credit scoring model for loans is formulated for purposes of prediction.

b. Discriminant Analysis (DA)

The discriminant analysis is aims at finding the discriminant function and to classify items into one of two or more groups having certain features describing those items. To maximize the difference between two groups is the main purpose of the discriminant analysis, while the differences among particular members of the same group are minimized. One group consists of good borrowers (non-defaulters - group A) and the other the bad ones (already-defaulters – group B) in the sphere of credit risks models. By means of the discriminant variable – score Z the differences are measured. For a given borrower I, the score is calculated as follows:

Where x represents a given character, y stands for the coefficient within the estimated model and n denotes a number of indicators.

A linear combination of the independent variables is what the DA seeks to find. The purpose been to classify the observations into equally exclusive groups as precisely as possible. This is achieved by maximizing the variance of the two ratio of among- groups to within groups. The discriminant function bears the following form:

Where denotes the jth independent variable, represents the jth independent variable and Z shows the discriminant score that that maximizes the difference between the two.

In this study, four variables which are considered as the discriminant variables were used. In the chosen sample, they were applied to find out the fitted discriminant score which will signify the discriminant criterion allowing distinguishing between the default and the non-default borrowers.

c. Significance of the Model and Interpretation of the Coefficients

Logistic regression is only practical for large samples; this makes the checking of lacks of multi-collinearity among variables/items essential. Due to the reduced number of explanatory variables in our study, however this issue isn’t raised. Before interpreting estimate coefficients, we can ask ourselves of the quality or the general implication of the model by adopting the R-square of Cox and Snell, which is can be determined by use of the following formula:

The R-square stands for the explained variance of the model. We find that the R-square of the Cox and Snell is equivalent to 0.9592, indicating a right fitted model.

The table below presents the model summary.

Model Summary

Comparison criteria

Values

Deviance (dev)

44.49

Degree of freedom (df)

599

Chi-square test

661.236

Dispersion

0.39

From the table it is evident that the Chi-square value is higher than the deviance (dev) this makes the model globally significant.

d. Analysis of Sensitivity and Predictive Power of Model

The model in testing shows a specificity of 1.526% and a sensitivity of 99.41%. The misclassification rate of the default payment in the category of the non-defaults is represented of 0.586%. This proves the successful prediction of the quality of borrowers by the model. From the finding s the model categorizes 89% of the annotations of our sample. In terms of the capacity of prediction, could be stronger, but we are to put in mind the trial nature of this model.

The following table serves as an illustration that indicates the model prediction power along with the analysis of sensitivity.

No. of Observed Borrowers

Total

Default Likelihood > 0.5 (y=1)

Non-Default Likelihood < 0.5 (y=0)

Default

364

4

368

Default prob. < 0.5 (y=1)

Non-default prob. > 0.5 (y=0)

Non-default

2

236

265

Total

366

267

633

4. Model Tuning and Validation

a. Relevance Weighed Ensemble Model for Anomaly Detection

Detection of anomaly is instrumental in online-data mining processes. The main concern associated with detection of incongruity is the dynamically evolving nature of the various monitoring settings. This results in a challenge for conventional anomaly detection techniques in data streams, which take up a relatively static monitoring setting. In a setting that is intermittently altering, referred to as the switching data streams, static techniques result into large rate of error through false positives (Yang et al., 2009). There need to take on a system that can identify from the history of typical actions in streams of data, to deal with the vibrant environments. This occurs while taking into account the aspect that not all periods of time in the past are significantly pertinent (Aggarwal, 2012). Subsequently, a relevance-weighed ensemble model is projected for identifying the typical actions revolving around credit rating and it forms the foundation for incongruity detection technique. This approach is instrumental in enhancing the uncovering accuracy through employment of relevant history, while maintaining computational efficiency. The relevance-weighed ensemble model offers a pertinent contribution by utilization of ensemble approaches for detecting anomaly in data steams used. It is possible to achieve considerable enhancements through artificial and real data steams as compared to other modern detection algorithms of abnormality for streams of data.

b. Model Tuning

Most regression as well as classification models are largely adjustable in that they have the capacity to model complex relationships. There are tuning parameters that administer adaptability of every model, ensuring that every model can pinpoint predictive behaviors as well as frameworks within the data. Nonetheless, these tuning attributes can establish predictive outlines that are not reproducible. This aspect is referred to as over-fitting. Over-fit models normally have superior predictivity for the data samples on which they are generated from, but show low predictivity for fresh samples (Steyerberg, 2010).

Most of models have significant attributes that cannot be unswervingly predicted from the data. For instance, in the K-nearest neighbor categorization model, a fresh sample is estimated subject to the K-closest data values in the default data set. The challenge is on the number of neighbors that can be utilized. Opting for too few neighbors leads to over-fitting of the distinct values of the default set while on the other hand using too many neighbors might not be responsive enough to produce rational performance (Steyerberg, 2010. This form of model parameter is called the tuning parameter since there is no formula for assessment available to compute an appropriate value.

Most models contain more than one tuning parameter. Poor choosing of the values can lead to over-fitting since majority of these parameters are attributed to the model complexity. There are various techniques for looking for the finest parameters. A general approach involves defining a range of prospective values, generating reliable estimations of model utility across the prospective values, and finally choosing the optimal settings. Below is a flow chart that accentuates this approach.

c. Model Validation

The major benefit of employing logistic regression model is the simplicity in the results interpretation through the use of odd ratios (Han et al., 2018). Besides, logistic regression is a studier technique for models that are dependent on binomial outcomes that make use of numerous descriptive aspects. Furthermore, since normal regression does not uphold the assumption of ordinariness in the event of unqualified output variable, logistic regression deals with this concern through provision of a model that reflects the non-linear output in a linear manner within the boundaries of 0 and 1. Since loan lending plays a critical role in global finance, credit scoring is a vital technique of evaluating the credit risk. Most previous researches made use of numerous mechanical learning techniques which included Neural Networks, Decision trees, logistic regression as well as support vector machine. Every mechanical learning algorithm demonstrated accuracy and instrumental in many environments. While many studies emphasized on accurateness for the forecast of loan default uncovering, it was evident that few researches put focus on the consequences of false negatives which proves to be considerably overwhelming to the lending banks.

Upon selecting a prospect range of parameters, a dependable prediction of model performance is then attained. The performance on the present samples is then amassed into performance outlook which is subsequently employed in establishing the final tuning parameters. The final model is then formulated encapsulating all the default data through the tuning parameters. The loan default data can then be re-sampled as assessed numerous times for every tuning parameter point. The resultant values are then amassed in an effort to obtain the optimal value of K. The technique outlined in the flow chart presented above makes use of prospect models that are subject to the tuning parameters.

Mitchell (1998) has proposed another technique known as genetic algorithms while Olsson & Nelson (1975) proposed the simplex search method. These two techniques are useful in determining the most favorable tuning parameters. These approaches establish the apposite values for tuning parameters algorithmically, and iterate up until they attain a parameter situation with most advantageous performance. The approaches lean towards assessing a huge number of prospect models and offer superior results compared to an established range of tuning parameters in the event that the model performance is effectively computed

References

Aggarwal, C.C. (2012), A Segment-Based Framework For Modeling And Mining Data Streams. Knowledge and inf. Sys. 30(1), 1–29

Akkoc, S., 2012. An Empirical Comparison of Conventional Techniques, Neural Networks and the Three Stage Hybrid Adaptive Neuro Fuzzy Inference System (Anfis) Model for Credit Scoring Analysis: The Case of Turkish Credit Card Data. European Journal of Operational Research, 222(1): 168–178.

Alexander B. (2012), Determinant of Bank Failures the Case of Russia, Journal of Applied Statistics, 78 (32), 235-403.

Anatoly B. J (2014). The probability of default models of Russian banks. Journal of Institute of Economics in Transition 21 (5), 203-278.

Altman E. (1968), Financial Ratios, Discriminant Analysis, and Prediction of Corporate Bankruptcy, Journal of Finance 23 (4) 589-609.

Beirlant, (2004), Statistics of extremes, Hoboken, NJ: Wiley.

Calabrese, R. (2012). Modeling SME Loan Defaults as Rare Events: The Generalized Extreme Value Regression, Journal of Applied Statistics, 00 (00), 1-17.

Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values, London: Springer.

Gilli, M., & KeÌllezi, E. (2000). Extreme Value Theory for Tail-Related Risk Measures, Geneva: FAME.

Goodhart, C. (2011). The Basel Committee on Banking Supervision, Cambridge: Cambridge University Press

Han, J.T., Choi, J.S., Kim, M.J. and Jeong, J., (2018). Developing a Risk Group Predictive Model for Korean Students Falling Into Bad Debt, Asian Economic Journal, 32(1), pp.3–14.

Thomas, L., 2000. A Survey of Credit And Behavioral Scoring: Forecasting Financial Risk Of Lending To Consumers. Int. J. Forecast, 16(2): 149–172.

Rafaella, C. Giampiero, M. (n.d.). Bankruptcy Prediction Of Small And Medium Enterprises Using S Flexible Binary GEV Extreme Value Model, American Journal of Theoretical and Applied Statistics, 1307 (2), 3556-3798.

Singhee, A., & Rutenbar, R. (2010) Extreme Statistics In Nanoscale Memory Design. New York: Springer.

Yap, P., S. Ong and N. Husain (2011). Using Data Mining to Improve Assessment of Credit Worthiness via Credit Scoring Models, Exp. Syst. Appl, 38(10): 1374–1383.

Yang, D., Rundensteiner, E.A., Ward, M.O. (2009). Neighbor-Based Pattern Detection For Windows Over Streaming Data. In: Advances in DB Tech., pp. 529–540. ACM

Defining Variables for Tuning Parameters

Data Re-sampling, Model Fitting, and Hold-outs Prediction

Aggregating Re-sampling into Performance Profiles

Final Tuning Parameters

Applying Final Tuning Parameters and Refitting the Model