Zeek only

ModelEval-BookChapterCaseStudy.pdf

Home >Business & Finance homework help >Accounting homework help >Zeek only

CHAPTER 8

Model Evaluation

In this chapter, the most commonly used methods for testing the quality of a data science model will be formally introduced. Throughout this book, vari- ous validation techniques have been used to split the available data into a training set and a testing set. In the implementation sections, different types of performance operators in conjunction with validation have been used without an in detail explanation of how these operators really function. Several ways in which predictive data science models are evaluated for their performance will now be discussed.

There are a few main tools that are available to test a classification model’s quality: confusion matrices (or truth tables), lift charts, ROC (receiver operator characteristic) curves, area under the curve (AUC). How these tools are con- structed will be defined in detail and how to implement performance eva- luations will be described. To evaluate a numeric prediction from a regression model, there are many conventional statistical tests that may be applied (Black, 2008) a few of which were discussed in Chapter 5, Regression Methods.

DIRECT MARKETING

Direct marketing (DM) companies, which send out postal mail (or in the days before do-not-call lists, they called prospects) were one of the early pioneers in applying data science techniques (Berry, 1999). A key performance indi- cator for their marketing activities is of course the improvement in their bottom line as a result of their utili- zation of predictive models.

Assume that a typical average response rate for a direct mail campaign is 10%. Further assume that: cost per mail sent 5 $1 and potential revenue per response 5 $20.

If they have 10,000 people to send out their mailers to, then they can expect to receive potential revenues of 10,000 3 10% 3 $20 5 $20,000, which would yield a net return of $10,000. Typically, the mailers are sent out in batches to spread costs over a period of time. Further assume that these are sent out in batches of 1000. The first question someone would ask is how to divide the list of names into these batches. If the average expectation of return is 10%, then would it not make a lot of sense to just send one batch of mails to those prospects that make up this 10% and be done with the campaign?

(Continued )

263

https://doi.org/10.1016/B978-0-12-814761-0.00008-3

8.1 CONFUSION MATRIX

Classification performance is best described by an aptly named tool called the confusion matrix or truth table. Understanding the confusion matrix requires becoming familiar with several definitions. But before introducing the defini- tions, a basic confusion matrix for a binary or binomial classification must first be looked at where there can be two classes (say, Y or N). The accuracy of classification of a specific example can be viewed in one of four possible ways:

� The predicted class is Y, and the actual class is also Y - this is a True Positive or TP

� The predicted class is Y, and the actual class is N - this is a False Positive or FP

� The predicted class is N, and the actual class is Y - this is a False Negative or FN

� The predicted class is N, and the actual class is also N - this is a True Negative or TN

A basic confusion matrix is traditionally arranged as a 2 3 2 matrix as shown in Table 8.1. The predicted classes are arranged horizontally in rows and the

Table 8.1 Confusion Matrix

Actual Class (Observation)

Y N

Predicted class (expectation) Y TP correct result FP unexpected result N FN missing result TN correct absence of result

TP, true positive; FP, false positive; FN, false negative; TN, true negative.

(Continued ) Clearly this would save a lot of time and money and the net return would jump to $19,000!

Can all of these 10 percenters be identified? While this is clearly unrealistic, classification techniques can be used to rank or score prospects by the likelihood that they would respond to the mailers. Predictive analytics is after all about converting future uncertainties into usable prob- abilities (Taylor, 2011). Then a predictive method can be used to order these probabilities and send out the mailers

to only those who score above a particular threshold (say 85% chance of response).

Finally, some techniques may be better suited to this prob- lem than others. How can the different available methods be compared based on their performance? Will logistic regression capture these top 10 percenters better than support vector machines? What are the different metrics that can be used to select the best performing methods? These are some of the things that will be discussed in this chapter in more detail.

264 CHAPTER 8: Model Evaluation

actual classes are arranged vertically in columns, although sometimes this order is reversed (Kohavi & Provost, 1998). A quick way to examine this matrix or a truth table as it is also called is to scan the diagonal from top left to bottom right. An ideal classification performance would only have entries along this main diagonal and the off-diagonal elements would be zero.

These four cases will now be used to introduce several commonly used terms for understanding and explaining classification performance. As mentioned earlier, a perfect classifier will have no entries for FP and FN (i.e., the number of FP 5 number of FN 5 0).

1. Sensitivity is the ability of a classifier to select all the cases that need to be selected. A perfect classifier will select all the actual Y’s and will not miss any actual Y’s. In other words it will have no FNs. In reality, any classifier will miss some true Y’s, and thus, have some FNs. Sensitivity is expressed as a ratio (or percentage) calculated as follows: TP/(TP 1 FN). However, sensitivity alone is not sufficient to evaluate a classifier. In situations such as credit card fraud, where rates are typically around 0.1%, an ordinary classifier may be able to show sensitivity of 99.9% by picking nearly all the cases as legitimate transactions or TP. The ability to detect illegitimate or fraudulent transactions, the TNs, is also needed. This is where the next measure, specificity, which ignores TPs, comes in.

2. Specificity is the ability of a classifier to reject all the cases that need to be rejected. A perfect classifier will reject all the actual N’s and will not deliver any unexpected results. In other words, it will have no FPs. In reality, any classifier will select some cases that need to be rejected, and thus, have some FPs. Specificity is expressed as a ratio (or percentage) calculated as: TN/(TN 1 FP).

3. Relevance is a term that is easy to understand in a document search and retrieval scenario. Suppose a search is run for a specific term and that search returns 100 documents. Of these, let us say only 70 were useful because they were relevant to the search. Furthermore, the search actually missed out on an additional 40 documents that could actually have been useful. With this context, additional terms can be defined.

4. Precision is defined as the proportion of cases found that were actually relevant. From the example, this number was 70, and thus, the precision is 70/100 or 70%. The 70 documents were TP, whereas the remaining 30 were FP. Therefore, precision is TP/(TP 1 FP).

5. Recall is defined as the proportion of the relevant cases that were actually found among all the relevant cases. Again, with the example, only 70 of the total 110 (70 found 1 40 missed) relevant cases were

8.1 Confusion Matrix 265

actually found, thus, giving a recall of 70/110 5 63.63%. It is evident that recall is the same as sensitivity, because recall is also given by TP/(TP 1 FN).

6. Accuracy is defined as the ability of the classifier to select all cases that need to be selected and reject all cases that need to be rejected. For a classifier with 100% accuracy, this would imply that FN 5 FP 5 0. Note that in the document search example, the TN has not been indicated, as this could be really large. Accuracy is given by (TP 1 TN)/ (TP 1 FP 1 TN 1 FN). Finally, error is simply the complement of accuracy, measured by (1 2 accuracy).

Table 8.2 summarizes all the major definitions. Fortunately, the analyst does not need to memorize these equations because their calculations are always automated in any tool of choice. However, it is important to have a good fundamental understanding of these terms.

8.2 ROC AND AUC

Measures like accuracy or precision are essentially aggregates by nature, in the sense that they provide the average performance of the classifier on the dataset. A classifier can have a high accuracy on a dataset but have poor class recall and precision. Clearly, a model to detect fraud is no good if its ability to detect TP for the fraud 5 yes class (and thereby its class recall) is low. It is, therefore, quite useful to look at measures that compare different metrics to see if there is a situation for a trade-off: for example, can a little overall accuracy be sacrificed to gain a lot more improvement in class recall? One can examine a model’s rate of detecting TPs and contrast it with its ability to detect FPs. The receiver operator characteristic (ROC) curves meet this need and were originally developed

Table 8.2 Evaluation Measures

Term Definition Calculation

Sensitivity Ability to select what needs to be selected TP/(TP 1 FN) Specificity Ability to reject what needs to be rejected TN/(TN 1 FP) Precision Proportion of cases found that were relevant TP/(TP 1 FP) Recall Proportion of all relevant cases that were

found TP/(TP 1 FN)

Accuracy Aggregate measure of classifier performance (TP 1 TN)/ (TP 1 TN 1 FP 1 FN)

TP, true positive; FP, false positive; FN, false negative; TN, true negative.

266 CHAPTER 8: Model Evaluation

in the field of signal detection (Green, 1966). A ROC curve is created by plotting the fraction of TPs (TP rate) versus the fraction of FPs (FP rate). When a table of such values is generated, the FP rate can be plotted on the horizontal axis and the TP rate (same as sensitivity or recall) on the vertical axis. The FP can also be expressed as (1 2 specificity) or TN rate.

Consider a classifier that could predict if a website visitor is likely to click on a banner ad: the model would be most likely built using historic click- through rates based on pages visited, time spent on certain pages, and other characteristics of site visitors. In order to evaluate the performance of this model on test data, a table such as the one shown in Table 8.3 can be generated.

The first column “Actual Class” consists of the actual class for a particular example (in this case a website visitor, who has clicked on the banner ad). The next column, “Predicted Class” is the model prediction and the third col- umn, “Confidence of response” is the confidence of this prediction. In order to create a ROC chart, the predicted data will need to be sorted in decreasing order of confidence level, which has been done in this case. By comparing columns Actual class and Predicted class, the type of prediction can be identi- fied: for instance, spreadsheet rows 2 through 5 are all TPs and row 6 is the first instance of a FP. As observed in columns “Number of TP” and “Number of FP,” one can keep a running count of the TPs and FPs and also calculate the fraction of TPs and FPs, which are shown in columns “Fraction of TP” and “Fraction of FP.”

Observing the “Number of TP” and “Number of FP” columns, it is evident that the model has discovered a total of 6 TPs and 4 FPs (the remaining 10 examples are all TNs). It can also be seen that the model has identified nearly 67% of all the TPs before it fails and hits its first FP (row 6 above). Finally, all TPs have been identified (when Fraction of TP 5 1) before the next FP was run into. If Fraction of FP (FP rate) versus Fraction of TP (TP rate) were now to be plotted, then a ROC chart similar to the one shown in Fig. 8.1 would be seen. Clearly an ideal classifier would have an accuracy of 100% (and thus, would have identified 100% of all TPs). Thus, the ROC for an ideal classifier would look like the dashed line shown in Fig. 8.1. Finally, an ordinary or random classifier (which has only a 50% accuracy) would possibly be able to find one FP for every TP, and thus, look like the 45- degree line shown.

As the number of test examples becomes larger, the ROC curve will become smoother: the random classifier will simply look like a straight line drawn between the points (0,0) and (1,1)—the stair steps become extremely small. The area under this random classifier’s ROC curve is basically the area of a

8.2 ROC and AUC 267

right triangle (with side 1 and height 1), which is 0.5. This quantity is termed Area Under the Curve or AUC. AUC for the ideal classifier is 1.0. Thus, the performance of a classifier can also be quantified by its AUC: obviously any AUC higher than 0.5 is better than random and the closer it is to 1.0, the better the performance. A common rule of thumb is to select those classifiers that not only have a ROC curve that is closest to ideal, but also an AUC higher than 0.8. Typical uses for AUC and ROC curves are to compare the performance of different classification algorithms for the same dataset.

8.3 LIFT CURVES

Lift curves or lift charts were first deployed in direct marketing where the problem was to identify if a particular prospect was worth calling or sending an advertisement by mail. It was mentioned in the use case at the beginning

Table 8.3 Classifier Performance Data Needed for Building a ROC Curve

Actual Class Predicted Class Confidence of “Response” Type?

Number of TP

Number of FP

Fraction of FP

Fraction of TP

Response Response 0.902 TP 1 0 0 0.167 Response Response 0.896 TP 2 0 0 0.333 Response Response 0.834 TP 3 0 0 0.500 Response Response 0.741 TP 4 0 0 0.667 No response Response 0.686 FP 4 1 0.25 0.667 Response Response 0.616 TP 5 1 0.25 0.833 Response Response 0.609 TP 6 1 0.25 1 No response Response 0.576 FP 6 2 0.5 1 No response Response 0.542 FP 6 3 0.75 1 No response Response 0.530 FP 6 4 1 1 No response No response 0.440 TN 6 4 1 1 No response No response 0.428 TN 6 4 1 1 No response No response 0.393 TN 6 4 1 1 No response No response 0.313 TN 6 4 1 1 No response No response 0.298 TN 6 4 1 1 No response No response 0.260 TN 6 4 1 1 No response No response 0.248 TN 6 4 1 1 No response No response 0.247 TN 6 4 1 1 No response No response 0.241 TN 6 4 1 1 No response No response 0.116 TN 6 4 1 1

ROC, receiver operator characteristic; TP, true positive; FP, false positive; TN, true negative.

268 CHAPTER 8: Model Evaluation

of this chapter that with a predictive model, one can score a list of prospects by their propensity to respond to an ad campaign. When the prospects are sorted by this score (by the decreasing order of their propensity to respond), one now ends up with a mechanism to systematically select the most valu- able prospects right at the beginning, and thus, maximize their return. Thus, rather than mailing out the ads to a random group of prospects, the ads can now be sent to the first batch of “most likely responders,” followed by the next batch and so on.

Without classification, the “most likely responders” are distributed ran- domly throughout the dataset. Suppose there is a dataset of 200 prospects and it contains a total of 40 responders or TPs. If the dataset is broken up into, say, 10 equal sized batches (called deciles), the likelihood of finding TPs in each batch is also 20%, that is, four samples in each decile will be TPs. However, when a predictive model is used to classify the prospects, a good model will tend to pull these “most likely responders” into the top few deciles. Thus, in this simple example, it might be found that the first two deciles will have all 40 TPs and the remaining eight deciles have none.

1.2

0.8

0.6

0.4

0.2

0 0.20 0.4 0.6

% FP

% T

0.8 1 1.2

ROC Ideal ROC Random ROC

FIGURE 8.1 Comparing ROC curve for the example shown in Table 8.3 to random and ideal classifiers. ROC, receiver operator characteristic.

8.3 Lift Curves 269

Lift charts were developed to demonstrate this in a graphical way (Rud, 2000). The focus is again on the TPs and, thus, it can be argued that they indicate the sensitivity of the model unlike ROC curves, which can show the relation between sensitivity and specificity.

The motivation for building lift charts was to depict how much better the classifier performs compared to randomly selecting x% of the data (for prospects to call) which would yield x% targets (to call or not). Lift is the improvement over this random selection that a predictive model can potentially yield because of its scoring or ranking ability. For example, in the data from Table 8.3, there are a total of 6 TPs out of 20 test cases. If one were to take the unscored data and randomly select 25% of the examples, it would be expected that 25% of them were TPs (or 25% of 6 5 1.5). However, scoring and reordering the dataset by confidence will improve this. As can be seen in Table 8.4, the first 25% or quartile of scored (reordered) data now con- tains four TPs. This translates to a lift of 4/1.5 5 2.67. Similarly, the second quartile of the unscored data can be expected to contain 50% (or three) of the TPs. As seen in Table 8.4, the scored 50% data contains all six TPs, giving a lift of 6/3 5 2.00.

The steps to build lift charts are:

1. Generate scores for all the data points (prospects) in the test set using the trained model.

2. Rank the prospects by decreasing score or confidence of response. 3. Count the TPs in the first 25% (quartile) of the dataset, and then the

first 50% (add the next quartile) and so on; see columns Cumulative TP and Quartile in Table 8.4.

4. Gain at a given quartile level is the ratio of the cumulative number of TPs in that quartile to the total number of TPs in the entire dataset (six in the example). The 1st quartile gain is, therefore, 4/6 or 67%, the 2nd quartile gain is 6/6 or 100%, and so on.

5. Lift is the ratio of gain to the random expectation at a given quartile level. Remember that random expectation at the xth quartile is x%. In the example, the random expectation is to find 25% of 6 5 1.5 TPs in the 1st quartile, 50% or 3 TPs in the 2nd quartile, and so on. The corresponding 1st quartile lift is, therefore, 4/1.5 5 2.667, the 2nd quartile lift is 6/3 5 2.00, and so on.

The corresponding curves for the simple example are shown in Fig. 8.2. Typically lift charts are created on deciles not quartiles. Quartiles were chosen here because they helped to illustrate the concept using the small 20-sample test dataset. However, the logic remains the same for deciles or any other groupings as well.

270 CHAPTER 8: Model Evaluation

8.4 HOW TO IMPLEMENT

A built-in dataset in RapidMiner will be used to demonstrate how all the three classification performances (confusion matrix, ROC/AUC, and lift/ gain charts) are evaluated. The process shown in Fig. 8.3 uses the Generate Direct Mailing Data operator to create a 10,000 record dataset. The objective of the modeling (Naïve Bayes used here) is to predict whether a person is likely to respond to a direct mailing campaign or not based on demo- graphic attributes (age, lifestyle, earnings, type of car, family status, and sports affinity).

Step 1: Data Preparation Create a dataset with 10,000 examples using the Generate Direct Mailing Data operator by setting a local random seed (default 5 1992) to ensure

Table 8.4 Scoring Predictions and Sorting by Confidences Is the Basis for Generating Lift Curves

Actual Class Predicted Class Confidence of “Response” Type?

Cumulative TP

Cumulative FP Quartile Gain Lift

Response Response 0.902 TP 1 0 1st 67% 2.666667 Response Response 0.896 TP 2 0 1st Response Response 0.834 TP 3 0 1st Response Response 0.741 TP 4 0 1st No response Response 0.686 FP 4 1 1st Response Response 0.616 TP 5 1 2nd 100% 2 Response Response 0.609 TP 6 1 2nd No response Response 0.576 FP 6 2 2nd No response Response 0.542 FP 6 3 2nd No response Response 0.530 FP 6 4 2nd No response No response 0.440 TN 6 4 3rd 100% 1.333333 No response No response 0.428 TN 6 4 3rd No response No response 0.393 TN 6 4 3rd No response No response 0.313 TN 6 4 3rd No response No response 0.298 TN 6 4 3rd No response No response 0.260 TN 6 4 4th 100% 1 No response No response 0.248 TN 6 4 4th No response No response 0.247 TN 6 4 4th No response No response 0.241 TN 6 4 4th No response No response 0.116 TN 6 4 4th

TP, true positive; FP, false positive; TN, true negative.

8.4 How to Implement 271

120%

100%

80%

60%

40%

20%

0% 1 2

Quartile

Lift Gain

3 4 0

0.5

1.5 L if

G ai

2.5

FIGURE 8.2 Lift and gain curves.

FIGURE 8.3 Process setup to demonstrate typical classification performance metrics.

272 CHAPTER 8: Model Evaluation

repeatability. Convert the label attribute from polynomial (nominal) to binominal using the appropriate operator as shown. This enables one to select specific binominal classification performance measures.

Split data into two partitions: an 80% partition (8000 examples) for model building and validation and a 20% partition for testing. An important point to note is that data partitioning is not an exact science and this ratio can change depending on the data.

Connect the 80% output (upper output port) from the Split Data operator to the Split Validation operator. Select a relative split with a ratio of 0.7 (70% for training) and shuffled sampling.

Step 2: Modeling Operator and Parameters Insert the naïve Bayes operator in the Training panel of the Split Validation operator and the usual Apply Model operator in the Testing panel. Add a Performance (Binomial Classification) operator. Select the following options in the performance operator: accuracy, FP, FN, TP, TN, sensitivity, specific- ity, and AUC.

Step 3: Evaluation Add another Apply Model operator outside the Split Validation operator and deliver the model to its mod input port while connecting the 2000 example data partition from Step 3 to the unl port. Add a Create Lift Chart operator with these options selected: target class 5 response, binning type 5 frequency, and number of bins 5 10. Note the port connections as shown in Fig. 8.3.

Step 4: Execution and Interpretation When the above process is run, the confusion matrix and ROC curve for the validation sample should be generated (30% of the original 80% 5 2400 examples), whereas a lift curve should be generated for the test sample (2000 examples). There is no reason why one cannot add another Performance (Binomial Classification) operator for the test sample or create a lift chart for the validation examples. (The reader should try this as an exer- cise—how will the output from the Create Lift Chart operator be delivered when it is inserted inside the Split Validation operator?)

The confusion matrix shown in Fig. 8.4 is used to calculate several common metrics using the definitions from Table 8.1. Compare them with the RapidMiner outputs to verify understanding.

8.4 How to Implement 273

TP 5 629; TN 5 1231; FP 5 394; FN 5 146

Term Definition Calculation

Sensitivity TP/(TP 1 FN) 629/(629 1 146) 5 81.16% Specificity TN/(TN 1 FP) 1231/(1231 1 394) 5 75.75% Precision TP/(TP 1 FP) 629/(629 1 394) 5 61.5% Recall TP/(TP 1 FN) 629/(629 1 146) 5 81.16% Accuracy (TP 1 TN)/

(TP 1 TN 1 FP 1 FN) (629 1 1231)/(629 1 1231 1 394 1 146) 5 77.5%

Note that RapidMiner makes a distinction between the two classes while cal- culating precision and recall. For example, in order to calculate a class recall for “no response,” the positive class becomes “no response” and the corre- sponding TP is 1231 and the corresponding FN is 394. Therefore, a class recall for “no response” is 1231/(1231 1 394) 5 75.75%, whereas the calcula- tion above assumed that “response” was the positive class. Class recall is an important metric to keep in mind when dealing with highly unbalanced data. Data are considered unbalanced if the proportion of the two classes is skewed. When models are trained on unbalanced data, the resulting class recalls also tend to be skewed. For example, in a dataset where there are only 2% responses, the resulting model can have a high recall for “no responses” but a very low class recall for “responses.” This skew is not seen in the overall model accuracy and using this model on unseen data may result in severe misclassifications.

The solution to this problem is to either balance the training data so that one ends up with a more or less equal proportion of classes or to insert pen- alties or costs on misclassifications using the Metacost operator as discussed in Chapter 5, Regression Methods. Data balancing is explained in more detail in Chapter 13, Anomaly Detection.

The AUC is shown along with the ROC curve in Fig. 8.5. As mentioned ear- lier, AUC values close to 1 are indicative of a good model. The ROC captures the sorted confidences of a prediction. As long as the prediction is correct for the examples the curve takes one step up (increased TP). If the prediction is wrong the curve takes one step to the right (increased FP). RapidMiner can

FIGURE 8.4 Confusion matrix for validation set of direct marketing dataset.

274 CHAPTER 8: Model Evaluation

FIGURE 8.5 ROC curve and AUC. ROC, receiver operator characteristic; AUC, area under the curve.

show two additional AUCs called optimistic and pessimistic. The differences between the optimistic and pessimistic curves occur when there are examples with the same confidence, but the predictions are sometimes false and some- times true. The optimistic curve shows the possibility that the correct predic- tions are chosen first so the curve goes steeper upwards. The pessimistic curve shows the possibility that the wrong predictions are chosen first so the curve increases more gradually.

Finally, the lift chart outputs do not directly indicate the lift values as has been demonstrated with the simple example earlier. In Step 5 of the process, 10 bins were selected for the chart and, thus, each bin will have 200 exam- ples (a decile). Recall that to create a lift chart all the predictions will need to be sorted by the confidence of the positive class (response), which is shown in Fig. 8.6.

The first bar in the lift chart shown in Fig. 8.7 corresponds to the first bin of 200 examples after the sorting. The bar reveals that there are 181 TPs in this bin (as can be seen from the table in Fig. 8.6 that the very second example, Row No. 1973, is an FP). From the confusion matrix earlier, 629 TPs can be seen in this example set. A random classifier would have identified 10% of these or 62.9 TPs in the first 200 examples. Therefore, the lift for the first dec- ile is 181/62.9 5 2.87. Similarly the lift for the first two deciles is (181 1 167)/(2 3 62.9) 5 2.76 and so on. Also, the first decile contains 181/ 629 5 28.8% of the TPs, the first two deciles contain (181 1 167)/ 629 5 55.3% of the TPs, and so on. This is shown in the cumulative (per- cent) gains curve on the right hand y-axis of the lift chart output.

As described earlier, a good classifier will accumulate all the TPs in the first few deciles and will have extremely few FPs at the top of the heap. This will result in a gain curve that quickly rises to the 100% level within the first few deciles.

8.5 CONCLUSION

This chapter covered the basic performance evaluation tools that are typically used in classification methods. Firstly the basic elements of a confusion matrix were described and then the concepts that are important to under- standing it, such as sensitivity, specificity, and accuracy were explored in detail. The ROC curve was then described, which has its origins in signal detection theory and has now been adopted for data science, along with the equally useful aggregate metric of AUC. Finally, two useful tools were described that have their origins in direct marketing applications: lift and gain charts. How to build these curves in general and how they can be

276 CHAPTER 8: Model Evaluation

FIGURE 8.6 Table of scored responses used to build the lift chart.

FIGURE 8.7 Lift chart generated.

constructed using RapidMiner was discussed. In summary, these tools are some of the most commonly used metrics for evaluating predictive models and developing skill and confidence in using these is a prerequisite to devel- oping data science expertise.

One key to developing good predictive models is to know when to use which measures. As discussed earlier, relying on a single measure like accuracy can be misleading. For highly unbalanced datasets, rely on several measures such as class recall and precision in addition to accuracy. ROC curves are fre- quently used to compare several algorithms side by side. Additionally, just as there are an infinite number of triangular shapes that have the same area, AUC should not be used alone to judge a model—AUC and ROCs should be used in conjunction to rate a model’s performance. Finally, lift and gain charts are most commonly used for scoring applications where the examples in a dataset need to be rank-ordered according to their propensity to belong to a particular category.

References Berry, M. A. (1999). Mastering data mining: The art and science of customer relationship management.

New York: John Wiley and Sons.

Black, K. (2008). Business statistics for contemporary decision making. New York: John Wiley and Sons.

Green, D. S. (1966). Signal detection theory and psychophysics. New York: John Wiley and Sons.

Kohavi, R., & Provost, F. (1998). Glossary of terms. Machine Learning, 30, 271�274. Rud, O. (2000). Data mining cookbook: Modeling data for marketing, risk and customer relationship

management. New York: John Wiley and Sons.

Taylor, J. (2011). Decision management systems: A practical guide to using business rules and predictive analytics. Boston, Massachusetts: IBM Press.

References 279

http://refhub.elsevier.com/B978-0-12-814761-0.00008-3/sbref1

http://refhub.elsevier.com/B978-0-12-814761-0.00008-3/sbref2

http://refhub.elsevier.com/B978-0-12-814761-0.00008-3/sbref3

http://refhub.elsevier.com/B978-0-12-814761-0.00008-3/sbref4

http://refhub.elsevier.com/B978-0-12-814761-0.00008-3/sbref5

http://refhub.elsevier.com/B978-0-12-814761-0.00008-3/sbref6

Cover
Data Science: Concepts and Practice
Copyright
Dedication
Foreword
Preface

Why Data Science?
Why This Book?
Who Can Use This Book?

Acknowledgments
1 Introduction

1.1 AI, Machine learning, and Data Science
1.2 What is Data Science?

1.2.1 Extracting Meaningful Patterns
1.2.2 Building Representative Models
1.2.3 Combination of Statistics, Machine Learning, and Computing
1.2.4 Learning Algorithms
1.2.5 Associated Fields

1.3 Case for Data Science

1.3.1 Volume
1.3.2 Dimensions
1.3.3 Complex Questions

1.4 Data Science Classification
1.5 Data Science Algorithms
1.6 Roadmap for This Book

1.6.1 Getting Started With Data Science
1.6.2 Practice using RapidMiner
1.6.3 Core Algorithms

References

2 Data Science Process

2.1 Prior Knowledge

2.1.1 Objective
2.1.2 Subject Area
2.1.3 Data
2.1.4 Causation Versus Correlation

2.2 Data Preparation

2.2.1 Data Exploration
2.2.2 Data Quality
2.2.3 Missing Values
2.2.4 Data Types and Conversion
2.2.5 Transformation
2.2.6 Outliers
2.2.7 Feature Selection
2.2.8 Data Sampling

2.3 Modeling

2.3.1 Training and Testing Datasets
2.3.2 Learning Algorithms
2.3.3 Evaluation of the Model
2.3.4 Ensemble Modeling

2.4 Application

2.4.1 Production Readiness
2.4.2 Technical Integration
2.4.3 Response Time
2.4.4 Model Refresh
2.4.5 Assimilation

2.5 Knowledge
References

3 Data Exploration

3.1 Objectives of Data Exploration
3.2 Datasets

3.2.1 Types of Data

Numeric or Continuous
Categorical or Nominal

3.3 Descriptive Statistics

3.3.1 Univariate Exploration

Measure of Central Tendency
Measure of Spread

3.3.2 Multivariate Exploration

Central Data Point
Correlation

3.4 Data Visualization

3.4.1 Univariate Visualization

Histogram
Quartile
Distribution Chart

3.4.2 Multivariate Visualization

Scatterplot
Scatter Multiple
Scatter Matrix
Bubble Chart
Density Chart

3.4.3 Visualizing High-Dimensional Data

Parallel Chart
Deviation Chart
Andrews Curves

3.5 Roadmap for Data Exploration
References

4 Classification

4.1 Decision Trees

4.1.1 How It Works

Step 1: Where to Split Data?
Step 2: When to Stop Splitting Data?

4.1.2 How to Implement

Implementation 1: To Play Golf or Not?
Implementation 2: Prospect Filtering

Step 1: Data Preparation
Step 2: Divide dataset Into Training and Testing Samples
Step 3: Modeling Operator and Parameters
Step 4: Configuring the Decision Tree Model
Step 5: Process Execution and Interpretation

4.1.3 Conclusion

4.2 Rule Induction

WARNING!!! DUMMY ENTRY

Approaches to Developing a Rule Set

4.2.1 How It Works

Step 1: Class Selection
Step 2: Rule Development
Step 3: Learn-One-Rule
Step 4: Next Rule
Step 5: Development of Rule Set

4.2.2 How to Implement

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Results Interpretation
Alternative Approach: Tree-to-Rules

4.2.3 Conclusion

4.3 k-Nearest Neighbors

4.3.1 How It Works

Measure of Proximity

Distance
Weights
Correlation similarity
Simple matching coefficient
Jaccard similarity
Cosine similarity

4.3.2 How to Implement

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Execution and Interpretation

4.3.3 Conclusion

4.4 Naïve Bayesian

4.4.1 How It Works

Step 1: Calculating Prior Probability P(Y)
Step 2: Calculating Class Conditional Probability P(Xi|Y)
Step 3: Predicting the Outcome Using Bayes’ Theorem
Issue 1: Incomplete Training Set
Issue 2: Continuous Attributes
Issue 3: Attribute Independence

4.4.2 How to Implement

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Evaluation
Step 4: Execution and Interpretation

4.4.3 Conclusion

4.5 Artificial Neural Networks

4.5.1 How It Works

Step 1: Determine the Topology and Activation Function
Step 2: Initiation
Step 3: Calculating Error
Step 4: Weight Adjustment

4.5.2 How to Implement

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Evaluation
Step 4: Execution and Interpretation

4.5.3 Conclusion

4.6 Support Vector Machines

WARNING!!! DUMMY ENTRY

Concept and Terminology

4.6.1 How It Works
4.6.2 How to Implement

Implementation 1: Linearly Separable Dataset

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Process Execution and Interpretation

Example 2: Linearly Non-Separable Dataset

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Execution and Interpretation

Parameter Settings

4.6.3 Conclusion

4.7 Ensemble Learners

WARNING!!! DUMMY ENTRY

Wisdom of the Crowd

4.7.1 How It Works

Achieving the Conditions for Ensemble Modeling

4.7.2 How to Implement

Ensemble by Voting
Bootstrap Aggregating or Bagging
Implementation
Boosting
AdaBoost
Implementation
Random Forest
Implementation

4.7.3 Conclusion

References

5 Regression Methods

5.1 Linear Regression

5.1.1 How it Works
5.1.2 How to Implement

Step 1: Data Preparation
Step 2: Model Building
Step 3: Execution and Interpretation
Step 4: Application to Unseen Test Data

5.1.3 Checkpoints

5.2 Logistic Regression

5.2.1 How It Works

How Does Logistic Regression Find the Sigmoid Curve?
A Simple but Tragic Example

5.2.2 How to Implement

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Execution and Interpretation
Step 4: Using MetaCost
Step 5: Applying the Model to an Unseen Dataset

5.2.3 Summary Points

5.3 Conclusion
References

6 Association Analysis

6.1 Mining Association Rules

6.1.1 Itemsets

Support
Confidence
Lift
Conviction

6.1.2 Rule Generation

6.2 Apriori Algorithm

6.2.1 How it Works

Frequent Itemset Generation
Rule Generation

6.3 Frequent Pattern-Growth Algorithm

6.3.1 How it Works

Frequent Itemset Generation

6.3.2 How to Implement

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Create Association Rules
Step 4: Interpreting the Results

6.4 Conclusion
References

7 Clustering

Clustering to Describe the Data
Clustering for Preprocessing
Types of Clustering Techniques
7.1 k-Means Clustering

7.1.1 How It Works

Step 1: Initiate Centroids
Step 2: Assign Data Points
Step 3: Calculate New Centroids
Step 4: Repeat Assignment and Calculate New Centroids
Step 5: Termination
Special Cases
Evaluation of Clusters

7.1.2 How to Implement

Step 1: Data Preparation
Step 2: Clustering Operator and Parameters
Step 3: Evaluation
Step 4: Execution and Interpretation

7.2 DBSCAN Clustering

7.2.1 How It Works

Step 1: Defining Epsilon and MinPoints
Step 2: Classification of Data Points
Step 3: Clustering
Optimizing Parameters
Special Cases: Varying Densities

7.2.2 How to Implement

Step 1: Data Preparation
Step 2: Clustering Operator and Parameters
Step 3: Evaluation
Step 4: Execution and Interpretation

7.3 Self-Organizing Maps

7.3.1 How It Works

Step 1: Topology Specification
Step 2: Initialize Centroids
Step 3: Assignment of Data Objects
Step 4: Centroid Update
Step 5: Termination
Step 6: Mapping a New Data Object

7.3.2 How to Implement

Step 1: Data Preparation
Step 2: SOM Modeling Operator and Parameters
Step 3: Execution and Interpretation
Visual Model
Location Coordinates
Conclusion

References

8 Model Evaluation

8.1 Confusion Matrix
8.2 ROC and AUC
8.3 Lift Curves
8.4 How to Implement

WARNING!!! DUMMY ENTRY

Step 1: Data Preparation
Step 2: Modeling Operator and Parameters
Step 3: Evaluation
Step 4: Execution and Interpretation

8.5 Conclusion
References

9 Text Mining

9.1 How It Works

9.1.1 Term Frequency–Inverse Document Frequency
9.1.2 Terminology

9.2 How to Implement

9.2.1 Implementation 1: Keyword Clustering

Step 1: Gather Unstructured Data
Step 2: Data Preparation
Step 3: Apply Clustering

9.2.2 Implementation 2: Predicting the Gender of Blog Authors

Step 1: Gather Unstructured Data
Step 2: Data Preparation
Step 3.1: Identify Key Features
Step 3.2: Build Models
Step 4.1: Prepare Test Data for Model Application
Step 4.2: Applying the Trained Models to Testing Data
Bias in Machine Learning

9.3 Conclusion
References

10 Deep Learning

10.1 The AI Winter

AI Winter: 1970’s
Mid-Winter Thaw of the 1980s
The Spring and Summer of Artificial Intelligence: 2006—Today

10.2 How it works

10.2.1 Regression Models As Neural Networks
10.2.2 Gradient Descent
10.2.3 Need for Backpropagation
10.2.4 Classifying More Than 2 Classes: Softmax
10.2.5 Convolutional Neural Networks
10.2.6 Dense Layer
10.2.7 Dropout Layer
10.2.8 Recurrent Neural Networks
10.2.9 Autoencoders
10.2.10 Related AI Models

10.3 How to Implement

WARNING!!! DUMMY ENTRY

Handwritten Image Recognition
Step 1: Dataset Preparation
Step 2: Modeling using the Keras Model
Step 3: Applying the Keras Model
Step 4: Results

10.4 Conclusion
References

11 Recommendation Engines

Why Do We Need Recommendation Engines?
Applications of Recommendation Engines
11.1 Recommendation Engine Concepts

WARNING!!! DUMMY ENTRY

Building up the Ratings Matrix
Step 1: Assemble Known Ratings
Step 2: Rating Prediction
Step 3: Evaluation
The Balance

11.1.1 Types of Recommendation Engines

11.2 Collaborative Filtering

11.2.1 Neighborhood-Based Methods

User-Based Collaborative Filtering
Step 1: Identifying Similar Users
Step 2: Deducing Rating From Neighborhood Users
Item-Based Collaborative Filtering
User-Based or Item-Based Collaborative Filtering?
Neighborhood based Collaborative Filtering - How to Implement
Dataset
Implementation Steps
Conclusion

11.2.2 Matrix Factorization

Matrix Factorization - How to Implement
Implementation Steps

11.3 Content-Based Filtering

WARNING!!! DUMMY ENTRY

Building an Item Profile

11.3.1 User Profile Computation

Content-Based Filtering - How to Implement
Dataset
Implementation steps

11.3.2 Supervised Learning Models

Supervised Learning Models - How to Implement
Dataset
Implementation steps

11.4 Hybrid Recommenders
11.5 Conclusion

WARNING!!! DUMMY ENTRY

Summary of the Types of Recommendation Engines

References

12 Time Series Forecasting

Taxonomy of Time Series Forecasting
12.1 Time Series Decomposition

12.1.1 Classical Decomposition
12.1.2 How to Implement

Forecasting Using Decomposed Data

12.2 Smoothing Based Methods

12.2.1 Simple Forecasting Methods

Naïve Method
Seasonal Naive Method
Average Method
Moving Average Smoothing
Weighted Moving Average Smoothing

12.2.2 Exponential Smoothing

Holt’s Two-Parameter Exponential Smoothing
Holt-Winters’ Three-Parameter Exponential Smoothing

12.2.3 How to Implement

R Script for Holt-Winters’ Forecasting

12.3 Regression Based Methods

12.3.1 Regression
12.3.2 Regression With Seasonality

How to implement

12.3.3 Autoregressive Integrated Moving Average

Autocorrelation
Autoregressive Models
Stationary Data
Differencing
Moving Average of Error
Autoregressive Integrated Moving Average
How to Implement

12.3.4 Seasonal ARIMA

How to Implement

12.4 Machine Learning Methods

12.4.1 Windowing

Model Training
How to Implement
Step 1: Set Up Windowing
Step 2: Train the Model
Step 3: Generate the Forecast in a Loop

12.4.2 Neural Network Autoregressive

How to Implement

12.5 Performance Evaluation

12.5.1 Validation Dataset

Mean Absolute Error
Root Mean Squared Error
Mean Absolute Percentage Error
Mean Absolute Scaled Error

12.5.2 Sliding Window Validation

12.6 Conclusion

12.6.1 Forecasting Best Practices

References

13 Anomaly Detection

13.1 Concepts

13.1.1 Causes of Outliers
13.1.2 Anomaly Detection Techniques

Outlier Detection Using Statistical Methods
Outlier Detection Using Data Science

13.2 Distance-Based Outlier Detection

13.2.1 How It Works
13.2.2 How to Implement

Step 1: Data Preparation
Step 2: Detect Outlier Operator
Step 3: Execution and Interpretation

13.3 Density-Based Outlier Detection

13.3.1 How It Works
13.3.2 How to Implement

Step 1: Data Preparation
Step 2: Detect Outlier Operator
Step 3: Execution and Interpretation

13.4 Local Outlier Factor

13.4.1 How it Works
13.4.2 How to Implement

Step 1: Data Preparation
Step 2: Detect Outlier Operator
Step 3: Results Interpretation

13.5 Conclusion
References

14 Feature Selection

14.1 Classifying Feature Selection Methods
14.2 Principal Component Analysis

14.2.1 How It Works
14.2.2 How to Implement

Step 1: Data Preparation
Step 2: PCA Operator
Step 3: Execution and Interpretation

14.3 Information Theory-Based Filtering
14.4 Chi-Square-Based Filtering
14.5 Wrapper-Type Feature Selection

14.5.1 Backward Elimination

14.6 Conclusion
References

15 Getting Started with RapidMiner

15.1 User Interface and Terminology

WARNING!!! DUMMY ENTRY

Terminology

15.2 Data Importing and Exporting Tools
15.3 Data Visualization Tools

WARNING!!! DUMMY ENTRY

Univariate Plots
Bivariate Plots
Multivariate Plots

15.4 Data Transformation Tools
15.5 Sampling and Missing Value Tools
15.6 Optimization Tools5
15.7 Integration with R
15.8 Conclusion
References

Comparison of Data Science Algorithms
About the Authors

Vijay Kotu
Bala Deshpande, PhD

Index
Praise
Back Cover