Data mining Xlminer

profilemagagi1
GermanCredit-111.docx

GERMAN CREDIT

GermanCredit.xls is the dataset for this case study.

Background

Money-lending has been around since the advent of money; it is perhaps the world’s second-oldest profession. The systematic evaluation of credit risk, though, is a relatively recent arrival, and lending was largely based on reputation and very incomplete data. Thomas Jefferson, the third President of the United States, was in debt throughout his life and unreliable in his debt payments, yet people continued to lend him money. It wasn’t until the beginning of the 20th century that the Retail Credit Company was founded to share information about credit. That company is now Equifax, one of the big three credit-scoring agencies (the other two are Transunion and Experian).

Individual and local human judgment are now largely irrelevant to the credit-reporting process. Credit agencies and other big financial institutions extending credit at the retail level collect huge amounts of data to predict whether defaults or other adverse events will occur, based on numerous customer and transaction information.

Data

This case deals with an early stage of the historical transition to predictive modeling, in which humans were employed to label records as either good or poor credit. The German Credit dataset has 30 variables and 1000 records, each record being a prior applicant for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases). Figure 21.1 shows the values of these variables for the first four records. All the variables are explained in  Table 21.6. New applicants for credit can also be evaluated on these 30 predictor variables and classified as a good or a bad credit risk based on the predictor variables.

The consequences of misclassification have been assessed as follows: The costs of a false positive (incorrectly saying that an applicant is a good credit risk) outweigh the benefits of a true positive (correctly saying that an applicant is a good credit risk) by a factor of 5. This is summarized in Table 21.4. The opportunity cost table was derived from the average net profit per loan as shown in Table 21.5.

Because decision makers are used to thinking of their decision in terms of net profits, we use these tables in assessing the performance of the various models.

Assignment

1. Review the predictor variables and guess what their role in a credit decision might be. Are there any surprises in the data?

2. Divide the data into training and validation partitions, and develop classification models using the following data mining techniques in XLMiner: logistic regression and classification trees. Use variants of these techniques as you see fit (e.g. best subset, best pruned tree, etc.). (Note: Depending on the capacity of your computer, Best Subsets may not run quickly, and possibly not at all; if this is the case, choose another selection technique.)

3. Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. Which technique has the most net profit?

4. Let us try and improve our performance. Rather than accept XLMiner's initial classification of all applicants’ credit status, use the “predicted probability of success” in logistic regression (where success means 1) as a basis for selecting the best credit risks first, followed by poorer risk applicants.

· Sort the validation on “predicted probability of success.”

· For each case, calculate the net profit of extending credit.

· Add another column for cumulative net profit.

a. How far into the validation data do you go to get maximum net profit? (Often, this is specified as a percentile or rounded to deciles.)

b. If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?