Classification Trees and k-NN

selen7
IAssignment2_Solutions1.docx

Week 2 Individual Assignment 2: Quantitative Analysis of Credit - Solutions

This assignment is based on the data we used during our two live sessions, but it has been updated to include a splitting variable ( credit2.xlsx ). In the spreadsheet under the tab “Data," you will find data

pertaining to 1,000 personal loan accounts. The tab “Data Dictionary” contains a description of what the various variables mean.

As a part of a new credit application, the company collects information about the applicant. The company then decides an amount of the credit extended (the variable CREDIT_EXTENDED). For these 1,000 accounts, we also have information on how profitable each account turned out to be (the variable NPV). A negative value indicates a net loss, and this typically happens when the debtor defaults on his/her payments.

The goal in this assignment is to investigate how one can use this data to better manage the bank's credit extension program. Specifically, our goal is to develop a classification model to classify a new credit account as “profitable” or “not profitable." Secondly we want to compare its performance in the context of decision support to a linear regression model that predicts NPV directly.

Please answer all the questions. Supply supporting documentation and show calculations as

needed. Please submit a single, well-formatted PDF or Word file. The instructor should not need to go searching for your answers! In addition, please upload an Excel file with your model outputs – the file will not be graded, but will help the instructor give you feedback, if your model differs substantially from the solutions.

For extra assistance, you may want to access the tutorials located on the course resource center page.

Data Preparation

The data preparation repeats the steps from the live session:

a) The goal is to predict whether or not a new credit will result in a profitable account. Create a new variable to use as the dependent variable.

b) Create dummy variables for all categorical variables with more than 2 values (or if you prefer, you can sort your variables into numerical and categorical when you run the model).

c) Split the data into 2 parts using the splitting variable that has been added to the data set. This is to ensure a more balanced split between the validation and training samples. Note that Analytic Solver Data Mining only allows 50 columns in the analysis, so leave out your base dummies (if you created them) when partitioning. After the data partition, you should have 666 rows in your training data and 334 in your validation data.

The Assignment

1. Applying Logistic Regression

If one fits a Logistic Regression Model using all the independent variables, one observes a) a gap in the classification performance between the training data and the validation data, and b) very

high p-values for some of the variables. The performance gap between the training and validation may be a sign of overfitting, and the high p-values may be a sign of “useless” variables in the model, or of multicollinearity.

a) Our goal is to classify credit requests into “profitable” and “not profitable." To that end, select to run “forward selection," and set FIN down to 1.5 (this lowers the threshold for a variable to enter the model, resulting in more models to choose from). Select one of the forward selection models based on the principles discussed in the book and/or the tutorials on the course resource center and run it.

Note: Exclude Credit Extended and any other variables not appropriate for the analysis.

Include the model (the variables and the corresponding regression coefficients) as an Exhibit.

Predictor

Estimate

Intercept

-0.1409

AGE

0.0350

NUM_CREDITS

-0.3472

DURATION

-0.0208

INSTALL_RATE

-0.4070

GUARANTOR

0.8746

OTHER_INSTALL

-0.6841

OWN_RES

0.5299

REAL_ESTATE

0.4792

AMOUNT_REQUESTED

-0.0001

GENDER_F

0.3894

CHK_ACCT_1

0.7863

CHK_ACCT_2

1.3594

CHK_ACCT_3

2.1811

SAV_ACCT_4

0.8059

HISTORY_4

0.6811

PRESENT_RESIDENT_2

-0.4176

EMPLOYMENT_2

0.3505

EMPLOYMENT_3

0.7936

TYPE_2

1.9168

TYPE_3

0.5290

TYPE_4

0.6752

Please refer to the Excel solutions for additional details.

b) Why did you select this particular model?

From the feature selection output we chose to run the model with 18 coefficients. This model was chosen because it has Cp close to the number of coefficients in the model (and not higher), it has probability above .05 and the improvement in RSS if we expand the model further is relatively small.

c) Based on your model, and setting the cut-off value to 0.5, please provide the following information (based on the validation data):

· The sensitivity of the model: 0.88

· The specificity of the model: 0.495

In other words, at the default cut-off we correctly identify 88% of the profitable customers, but include around 50% of the unprofitable customers.

2. ROC Curves

a) We now want to compare the predictive performance of the model on the training sample and on the validation sample. Create a single figure that compares the ROC curves for both the training sample and the validation sample. Please refer to the ROC tutorials in the resource center as needed for a step-by-step guide for creating an ROC curve. Alternatively, you can combine the two curves that Analytical Solver Data Mining provides into a single plot.

Include a clean figure as an Exhibit.

3. Finding the "best" cut-off

a) Create a data-table to calculate the total NPV (assuming we extend credit to all classified as

“profitable” as a function of the cut-off based on the training data. Select the best cut- off. Include the table as an Exhibit.

cut-off

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NPV training

-56740

-26705

-2644

18771

45734

68023

88312

90200

84550

63883

0

NPV validation

-39141

-15636

-6309

13945

31128

36599

51636

51623

43999

32005

0

Please refer to the solutions for a more detailed table.

b) What is your selected cut-off? 0.725 (based on a more detailed table in the Excel file)

c) Create the same table for the validation data. Include the table as an Exhibit.

Refer to the table above

d) Apply the cut-off you selected based on the training data to the validation data. What is the total profit on the validation data? $51,330

e) Provide a figure that shows the cumulative NPV as a function of the cut-off for both the training data and the validation data.

4. Comparison with linear regression

a) Repeat our model development from our first live session (note you need to repeat the steps as we now have a new data split). Rerun a variable selection model to find a "good model" using the updated data.

Include the model (the variables and the corresponding regression coefficients) as an Exhibit.

For my linear regression model I selected to run a stepwise selection with the default

parameters. Note that this is not the only "correct" model, a careful analysis would have included both backwards, forwards and stepwise variable selection and the comparison of a couple of candidate models, before selecting one based on their performance (and you can define performance in multiple ways as we have discussed). Hopefully your model’s performance exceeds the performance of the model discussed here!

Predictor

Estimate

P-Value

Intercept

594.6747

<0.001

RENT

-288.596

0.010

INSTALL_RATE

-152.409

<0.001

AMOUNT_REQUESTED

-0.17395

<0.001

CHK_ACCT_1

327.4346

0.005

CHK_ACCT_2

396.0738

0.033

CHK_ACCT_3

563.3212

<0.001

SAV_ACCT_4

378.5465

0.001

TYPE_2

525.5611

<0.001

TYPE_5

-489.718

0.011

TYPE_6

-500.304

0.001

b) Create a data table that summarizes the total profit as a function of the NPV cut-off for extending credit on the training data (note that now your cut-off is in $ you will need to investigate what is a good cut-off, for example -$50 or $50, or something else). Select the best cut-off.

Include the table as an Exhibit.

Please refer to the Excel file for detailed information, the table below shows some highlights.

total

training

NPV

validation

-750

74541

16890

-700

75131

19502

-650

73680

24582

-600

71729

27223

-550

79911

28620

-500

77482

30658

-450

80913

30622

-400

89518

33657

-350

86859

41614

-300

78862

41014

-250

84718

41885

-200

85475

44566

-150

81744

39188

-100

80865

36418

-50

75280

31917

0

71269

31705

50

64194

29323

100

62314

27407

150

59080

19587

200

52279

19844

250

50432

19938

300

46356

20864

350

40465

16292

400

33020

18002

450

25780

16046

500

22455

13862

550

16736

9610

600

12651

8478

650

9394

6152

700

9235

6167

750

8759

5329

c) What is your selected cut-off? -$400

d) Create the same table for the validation data and include it as an Exhibit.

Please refer to the table above

e) Apply the cut-off you found to the validation data. What is the total profit on the validation data? $33,657

f) Provide a figure that shows the cumulative NPV as a function of the cut-off for both the training data and the validation data.

5. Model comparison

a) Compare the performance of the logistic regression model and the linear regression model. How does the total profit compare for the two models? Which model would you select as the foundation of a decision support system and why?

When we compare the performance as measured by the total NPV is higher for both the training and the validation sample, as a result I would select my Logistic Regression model over the Linear Regression model (although it may be worth the effort to try to improve on the linear regression model).