Classification Trees and k-NN

IAssignment2_Solutions1.docx

Home >Mathematics homework help >Statistics homework help >Classification Trees and k-NN

Week 2 Individual Assignment 2: Quantitative Analysis of Credit - Solutions

This assignment is based on the data we used during our two live sessions, but it has been updated to include a splitting variable ( credit2.xlsx ). In the spreadsheet under the tab “Data," you will find data

pertaining to 1,000 personal loan accounts. The tab “Data Dictionary” contains a description of what the various variables mean.

As a part of a new credit application, the company collects information about the applicant. The company then decides an amount of the credit extended (the variable CREDIT_EXTENDED). For these 1,000 accounts, we also have information on how profitable each account turned out to be (the variable NPV). A negative value indicates a net loss, and this typically happens when the debtor defaults on his/her payments.

The goal in this assignment is to investigate how one can use this data to better manage the bank's credit extension program. Specifically, our goal is to develop a classification model to classify a new credit account as “profitable” or “not profitable." Secondly we want to compare its performance in the context of decision support to a linear regression model that predicts NPV directly.

Please answer all the questions. Supply supporting documentation and show calculations as

needed. Please submit a single, well-formatted PDF or Word file. The instructor should not need to go searching for your answers! In addition, please upload an Excel file with your model outputs – the file will not be graded, but will help the instructor give you feedback, if your model differs substantially from the solutions.

For extra assistance, you may want to access the tutorials located on the course resource center page.

Data Preparation

The data preparation repeats the steps from the live session:

a) The goal is to predict whether or not a new credit will result in a profitable account. Create a new variable to use as the dependent variable.

b) Create dummy variables for all categorical variables with more than 2 values (or if you prefer, you can sort your variables into numerical and categorical when you run the model).

c) Split the data into 2 parts using the splitting variable that has been added to the data set. This is to ensure a more balanced split between the validation and training samples. Note that Analytic Solver Data Mining only allows 50 columns in the analysis, so leave out your base dummies (if you created them) when partitioning. After the data partition, you should have 666 rows in your training data and 334 in your validation data.

The Assignment

1. Applying Logistic Regression

If one fits a Logistic Regression Model using all the independent variables, one observes a) a gap in the classification performance between the training data and the validation data, and b) very

high p-values for some of the variables. The performance gap between the training and validation may be a sign of overfitting, and the high p-values may be a sign of “useless” variables in the model, or of multicollinearity.

a) Our goal is to classify credit requests into “profitable” and “not profitable." To that end, select to run “forward selection," and set FIN down to 1.5 (this lowers the threshold for a variable to enter the model, resulting in more models to choose from). Select one of the forward selection models based on the principles discussed in the book and/or the tutorials on the course resource center and run it.

Note: Exclude Credit Extended and any other variables not appropriate for the analysis.

Include the model (the variables and the corresponding regression coefficients) as an Exhibit.

Predictor	Estimate
Intercept	-0.1409
AGE	0.0350
NUM_CREDITS	-0.3472
DURATION	-0.0208
INSTALL_RATE	-0.4070
GUARANTOR	0.8746
OTHER_INSTALL	-0.6841
OWN_RES	0.5299
REAL_ESTATE	0.4792
AMOUNT_REQUESTED	-0.0001
GENDER_F	0.3894
CHK_ACCT_1	0.7863
CHK_ACCT_2	1.3594
CHK_ACCT_3	2.1811
SAV_ACCT_4	0.8059
HISTORY_4	0.6811
PRESENT_RESIDENT_2	-0.4176
EMPLOYMENT_2	0.3505
EMPLOYMENT_3	0.7936
TYPE_2	1.9168
TYPE_3	0.5290
TYPE_4	0.6752

Please refer to the Excel solutions for additional details.

b) Why did you select this particular model?

From the feature selection output we chose to run the model with 18 coefficients. This model was chosen because it has Cp close to the number of coefficients in the model (and not higher), it has probability above .05 and the improvement in RSS if we expand the model further is relatively small.

c) Based on your model, and setting the cut-off value to 0.5, please provide the following information (based on the validation data):

· The sensitivity of the model: 0.88

· The specificity of the model: 0.495

In other words, at the default cut-off we correctly identify 88% of the profitable customers, but include around 50% of the unprofitable customers.

2. ROC Curves

a) We now want to compare the predictive performance of the model on the training sample and on the validation sample. Create a single figure that compares the ROC curves for both the training sample and the validation sample. Please refer to the ROC tutorials in the resource center as needed for a step-by-step guide for creating an ROC curve. Alternatively, you can combine the two curves that Analytical Solver Data Mining provides into a single plot.

Include a clean figure as an Exhibit.

3. Finding the "best" cut-off

a) Create a data-table to calculate the total NPV (assuming we extend credit to all classified as

“profitable” as a function of the cut-off based on the training data. Select the best cut- off. Include the table as an Exhibit.

cut-off	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
NPV training	-56740	-26705	-2644	18771	45734	68023	88312	90200	84550	63883	0
NPV validation	-39141	-15636	-6309	13945	31128	36599	51636	51623	43999	32005	0

Please refer to the solutions for a more detailed table.

b) What is your selected cut-off? 0.725 (based on a more detailed table in the Excel file)

c) Create the same table for the validation data. Include the table as an Exhibit.

Refer to the table above

d) Apply the cut-off you selected based on the training data to the validation data. What is the total profit on the validation data? $51,330

e) Provide a figure that shows the cumulative NPV as a function of the cut-off for both the training data and the validation data.

4. Comparison with linear regression

a) Repeat our model development from our first live session (note you need to repeat the steps as we now have a new data split). Rerun a variable selection model to find a "good model" using the updated data.

Include the model (the variables and the corresponding regression coefficients) as an Exhibit.

For my linear regression model I selected to run a stepwise selection with the default

parameters. Note that this is not the only "correct" model, a careful analysis would have included both backwards, forwards and stepwise variable selection and the comparison of a couple of candidate models, before selecting one based on their performance (and you can define performance in multiple ways as we have discussed). Hopefully your model’s performance exceeds the performance of the model discussed here!

Predictor	Estimate	P-Value
Intercept	594.6747	<0.001
RENT	-288.596	0.010
INSTALL_RATE	-152.409	<0.001
AMOUNT_REQUESTED	-0.17395	<0.001
CHK_ACCT_1	327.4346	0.005
CHK_ACCT_2	396.0738	0.033
CHK_ACCT_3	563.3212	<0.001
SAV_ACCT_4	378.5465	0.001
TYPE_2	525.5611	<0.001
TYPE_5	-489.718	0.011
TYPE_6	-500.304	0.001

b) Create a data table that summarizes the total profit as a function of the NPV cut-off for extending credit on the training data (note that now your cut-off is in $ you will need to investigate what is a good cut-off, for example -$50 or $50, or something else). Select the best cut-off.

Include the table as an Exhibit.

Please refer to the Excel file for detailed information, the table below shows some highlights.

	total training	NPV validation
-750	74541	16890
-700	75131	19502
-650	73680	24582
-600	71729	27223
-550	79911	28620
-500	77482	30658
-450	80913	30622
-400	89518	33657
-350	86859	41614
-300	78862	41014
-250	84718	41885
-200	85475	44566
-150	81744	39188
-100	80865	36418
-50	75280	31917
0	71269	31705
50	64194	29323
100	62314	27407
150	59080	19587
200	52279	19844
250	50432	19938
300	46356	20864
350	40465	16292
400	33020	18002
450	25780	16046
500	22455	13862
550	16736	9610
600	12651	8478
650	9394	6152
700	9235	6167
750	8759	5329

c) What is your selected cut-off? -$400

d) Create the same table for the validation data and include it as an Exhibit.

Please refer to the table above

e) Apply the cut-off you found to the validation data. What is the total profit on the validation data? $33,657

f) Provide a figure that shows the cumulative NPV as a function of the cut-off for both the training data and the validation data.

5. Model comparison

a) Compare the performance of the logistic regression model and the linear regression model. How does the total profit compare for the two models? Which model would you select as the foundation of a decision support system and why?

When we compare the performance as measured by the total NPV is higher for both the training and the validation sample, as a result I would select my Logistic Regression model over the Linear Regression model (although it may be worth the effort to try to improve on the linear regression model).