Classification Trees and k-NN
Week 2 Individual Assignment 2: Quantitative Analysis of Credit - Solutions
This assignment is based on the data we used during our two live sessions, but it has been updated to include a splitting variable ( credit2.xlsx ). In the spreadsheet under the tab “Data," you will find data
pertaining to 1,000 personal loan accounts. The tab “Data Dictionary” contains a description of what the various variables mean.
As a part of a new credit application, the company collects information about the applicant. The company then decides an amount of the credit extended (the variable CREDIT_EXTENDED). For these 1,000 accounts, we also have information on how profitable each account turned out to be (the variable NPV). A negative value indicates a net loss, and this typically happens when the debtor defaults on his/her payments.
The goal in this assignment is to investigate how one can use this data to better manage the bank's credit extension program. Specifically, our goal is to develop a classification model to classify a new credit account as “profitable” or “not profitable." Secondly we want to compare its performance in the context of decision support to a linear regression model that predicts NPV directly.
Please answer all the questions. Supply supporting documentation and show calculations as
needed. Please submit a single, well-formatted PDF or Word file. The instructor should not need to go searching for your answers! In addition, please upload an Excel file with your model outputs – the file will not be graded, but will help the instructor give you feedback, if your model differs substantially from the solutions.
For extra assistance, you may want to access the tutorials located on the course resource center page.
Data Preparation
The data preparation repeats the steps from the live session:
a) The goal is to predict whether or not a new credit will result in a profitable account. Create a new variable to use as the dependent variable.
b) Create dummy variables for all categorical variables with more than 2 values (or if you prefer, you can sort your variables into numerical and categorical when you run the model).
c) Split the data into 2 parts using the splitting variable that has been added to the data set. This is to ensure a more balanced split between the validation and training samples. Note that Analytic Solver Data Mining only allows 50 columns in the analysis, so leave out your base dummies (if you created them) when partitioning. After the data partition, you should have 666 rows in your training data and 334 in your validation data.
The Assignment
1. Applying Logistic Regression
If one fits a Logistic Regression Model using all the independent variables, one observes a) a gap in the classification performance between the training data and the validation data, and b) very
high p-values for some of the variables. The performance gap between the training and validation may be a sign of overfitting, and the high p-values may be a sign of “useless” variables in the model, or of multicollinearity.
a) Our goal is to classify credit requests into “profitable” and “not profitable." To that end, select to run “forward selection," and set FIN down to 1.5 (this lowers the threshold for a variable to enter the model, resulting in more models to choose from). Select one of the forward selection models based on the principles discussed in the book and/or the tutorials on the course resource center and run it.
Note: Exclude Credit Extended and any other variables not appropriate for the analysis.
Include the model (the variables and the corresponding regression coefficients) as an Exhibit.
|
Predictor |
Estimate |
|
Intercept |
-0.1409 |
|
AGE |
0.0350 |
|
NUM_CREDITS |
-0.3472 |
|
DURATION |
-0.0208 |
|
INSTALL_RATE |
-0.4070 |
|
GUARANTOR |
0.8746 |
|
OTHER_INSTALL |
-0.6841 |
|
OWN_RES |
0.5299 |
|
REAL_ESTATE |
0.4792 |
|
AMOUNT_REQUESTED |
-0.0001 |
|
GENDER_F |
0.3894 |
|
CHK_ACCT_1 |
0.7863 |
|
CHK_ACCT_2 |
1.3594 |
|
CHK_ACCT_3 |
2.1811 |
|
SAV_ACCT_4 |
0.8059 |
|
HISTORY_4 |
0.6811 |
|
PRESENT_RESIDENT_2 |
-0.4176 |
|
EMPLOYMENT_2 |
0.3505 |
|
EMPLOYMENT_3 |
0.7936 |
|
TYPE_2 |
1.9168 |
|
TYPE_3 |
0.5290 |
|
TYPE_4 |
0.6752 |
Please refer to the Excel solutions for additional details.
b) Why did you select this particular model?
From the feature selection output we chose to run the model with 18 coefficients. This model was chosen because it has Cp close to the number of coefficients in the model (and not higher), it has probability above .05 and the improvement in RSS if we expand the model further is relatively small.
c) Based on your model, and setting the cut-off value to 0.5, please provide the following information (based on the validation data):
· The sensitivity of the model: 0.88
· The specificity of the model: 0.495
In other words, at the default cut-off we correctly identify 88% of the profitable customers, but include around 50% of the unprofitable customers.
2. ROC Curves
a) We now want to compare the predictive performance of the model on the training sample and on the validation sample. Create a single figure that compares the ROC curves for both the training sample and the validation sample. Please refer to the ROC tutorials in the resource center as needed for a step-by-step guide for creating an ROC curve. Alternatively, you can combine the two curves that Analytical Solver Data Mining provides into a single plot.
Include a clean figure as an Exhibit.
3. Finding the "best" cut-off
a) Create a data-table to calculate the total NPV (assuming we extend credit to all classified as
“profitable” as a function of the cut-off based on the training data. Select the best cut- off. Include the table as an Exhibit.
|
cut-off |
0 |
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
1 |
|
NPV training |
-56740 |
-26705 |
-2644 |
18771 |
45734 |
68023 |
88312 |
90200 |
84550 |
63883 |
0 |
|
NPV validation |
-39141 |
-15636 |
-6309 |
13945 |
31128 |
36599 |
51636 |
51623 |
43999 |
32005 |
0 |
Please refer to the solutions for a more detailed table.
b) What is your selected cut-off? 0.725 (based on a more detailed table in the Excel file)
c) Create the same table for the validation data. Include the table as an Exhibit.
Refer to the table above
d) Apply the cut-off you selected based on the training data to the validation data. What is the total profit on the validation data? $51,330
e) Provide a figure that shows the cumulative NPV as a function of the cut-off for both the training data and the validation data.
4. Comparison with linear regression
a) Repeat our model development from our first live session (note you need to repeat the steps as we now have a new data split). Rerun a variable selection model to find a "good model" using the updated data.
Include the model (the variables and the corresponding regression coefficients) as an Exhibit.
For my linear regression model I selected to run a stepwise selection with the default
parameters. Note that this is not the only "correct" model, a careful analysis would have included both backwards, forwards and stepwise variable selection and the comparison of a couple of candidate models, before selecting one based on their performance (and you can define performance in multiple ways as we have discussed). Hopefully your model’s performance exceeds the performance of the model discussed here!
|
Predictor |
Estimate |
P-Value |
|
Intercept |
594.6747 |
<0.001 |
|
RENT |
-288.596 |
0.010 |
|
INSTALL_RATE |
-152.409 |
<0.001 |
|
AMOUNT_REQUESTED |
-0.17395 |
<0.001 |
|
CHK_ACCT_1 |
327.4346 |
0.005 |
|
CHK_ACCT_2 |
396.0738 |
0.033 |
|
CHK_ACCT_3 |
563.3212 |
<0.001 |
|
SAV_ACCT_4 |
378.5465 |
0.001 |
|
TYPE_2 |
525.5611 |
<0.001 |
|
TYPE_5 |
-489.718 |
0.011 |
|
TYPE_6 |
-500.304 |
0.001 |
b) Create a data table that summarizes the total profit as a function of the NPV cut-off for extending credit on the training data (note that now your cut-off is in $ you will need to investigate what is a good cut-off, for example -$50 or $50, or something else). Select the best cut-off.
Include the table as an Exhibit.
Please refer to the Excel file for detailed information, the table below shows some highlights.
|
|
total training |
NPV validation |
|
-750 |
74541 |
16890 |
|
-700 |
75131 |
19502 |
|
-650 |
73680 |
24582 |
|
-600 |
71729 |
27223 |
|
-550 |
79911 |
28620 |
|
-500 |
77482 |
30658 |
|
-450 |
80913 |
30622 |
|
-400 |
89518 |
33657 |
|
-350 |
86859 |
41614 |
|
-300 |
78862 |
41014 |
|
-250 |
84718 |
41885 |
|
-200 |
85475 |
44566 |
|
-150 |
81744 |
39188 |
|
-100 |
80865 |
36418 |
|
-50 |
75280 |
31917 |
|
0 |
71269 |
31705 |
|
50 |
64194 |
29323 |
|
100 |
62314 |
27407 |
|
150 |
59080 |
19587 |
|
200 |
52279 |
19844 |
|
250 |
50432 |
19938 |
|
300 |
46356 |
20864 |
|
350 |
40465 |
16292 |
|
400 |
33020 |
18002 |
|
450 |
25780 |
16046 |
|
500 |
22455 |
13862 |
|
550 |
16736 |
9610 |
|
600 |
12651 |
8478 |
|
650 |
9394 |
6152 |
|
700 |
9235 |
6167 |
|
750 |
8759 |
5329 |
c) What is your selected cut-off? -$400
d) Create the same table for the validation data and include it as an Exhibit.
Please refer to the table above
e) Apply the cut-off you found to the validation data. What is the total profit on the validation data? $33,657
f) Provide a figure that shows the cumulative NPV as a function of the cut-off for both the training data and the validation data.
5. Model comparison
a) Compare the performance of the logistic regression model and the linear regression model. How does the total profit compare for the two models? Which model would you select as the foundation of a decision support system and why?
When we compare the performance as measured by the total NPV is higher for both the training and the validation sample, as a result I would select my Logistic Regression model over the Linear Regression model (although it may be worth the effort to try to improve on the linear regression model).