Classification Trees and k-NN

profileselen7
W3_IAssignment_3.docx

Classification Trees and k-NN applied to Bank Credit

This assignment concludes the analysis of the credit data, exploring whether we can improve on our earlier analysis that utilized linear and logistic regression. Please refer to the earlier assignments for the data description, and repeat if needed the data preparation steps, using the credit2.xlsx data:

In the spreadsheet under the tab “Data," you will find data pertaining to 1,000 personal loan accounts. The tab “Data Dictionary” contains a description of what the various variables mean.

As a part of a new credit application, the company collects information about the applicant. The company then decides an amount of the credit extended (the variable CREDIT_EXTENDED). For these 1,000 accounts, we also have information on how profitable each account turned out to be (the variable NPV). A negative value indicates a net loss, and this typically happens when the debtor defaults on his/her payments.

1. Create a categorical variable that indicates whether or not a new credit extension will result in a positive NPV.

2. Create dummy variables for all categorical variables with more than two values (if appropriate).

3. Split the data into two parts using the splitting variable that is a part of the data set[footnoteRef:1]. This is to ensure a more balanced split between the validation and training samples. After the data partition you should have 666 rows in your training data and 334 in your validation data. [1: If you run into issues that your # of columns exceeds 50, you may leave out the employment variable.]

Please answer all questions. Supply supporting documentation and show calculations as needed. Please submit a single well-formatted  Word file.  In addition, please upload an Excel file with your model outputs .

Classification trees

Classify customers as profitable/not profitable with a classification tree

1. Run the Classification Tree algorithm using all the relevant independent variables (excluding as before Credit Extended, Obs# etc. ) including all the dummy variables (recall that one does not exclude base values when running classification trees), with the profitable/not profitable as the output variable. Use the validation data to prune back the tree, and select to use the best pruned tree for scoring.

a. Include the classification confusion matrix for the validation sample and a figure of the best pruned tree as Exhibits.  

2. Analyze the output.

a. How many decision nodes are in the best pruned tree?

b. What is the error rate for i) the training data and ii) the validation data in the best pruned tree?

c. What explains the difference in the error rate?

d. Which applicants for credit will get rejected by the model (using the best pruned tree)? (Describe the type of customers using the English language.)

3. Using the model for decision making.

a. Consider a 27-year-old domestic student that has $100 in her checking account but no savings account. The student has one existing credits, which has so far been paid back duly. The credit duration is 12 months. The applicant has been renting her current place for less than 12 months, does not own any real estate, just started graduate school (the present employment variable is set to 1 and nature of job to 2). The applicant has no dependents and no guarantor. The applicant wants to buy a used car and has requested $4,500 in credit, and therefore the installment rate is quite high, or 2.25%. However, the applicant does not have other installment plan credits. Finally, the applicant has a phone in her name.

How would the best pruned tree classify the student?

k-NN

Classify customers as profitable/not profitable with k-NN

4. Run the k-NN algorithm for classification, testing all values of k from 1 to 10, selecting to score the data on the best k (remember to standardize/normalize the data). Request detailed output for both the training and validation data.

a. Using the search log, plot the %Error of the validation sample. Include the plot in your assignment.

b. What is the best value of k?

c. Briefly explain why the % Error is zero for the training sample when k=1, but not for the validation sample.

5. Analyze the output.

a. What some of the main differences are between the customers identified as most likely to be profitable and the customers that are identified as least likely to be profitable? Briefly discuss. 

Method comparisons

You have now run three different classification algorithms on this data; logistic regression, classification tree and k-NN. Compare their performance in two ways. First using statistical measures and second using their possible impact on the credit extension process. Feel free to take advantage of the solutions to Individual Assignment 2 as a starting point.

Hint: Below is a potential set-up to measure the business impact. First, collect the predicted probability of being profitable for both the training and validation data as well as the true NPV into a single spreadsheet. Perhaps similar to this:

Then select a cell for a cut-off (in my case I used E1). Then for each method and each sample we can calculate the cumulative profit, for a specific cut-off using the sumifs() function in Excel. Specifically, the following formula sums up the NPV of all credit extensions that are made using the training sample and logistic regression:

You then need to extend this approach to both data samples and all three methods. Perhaps similarly to this:

You can then create data tables to investigate the best cut-off for each method and the corresponding NPV on the validation data.