Zeek only
CHAPTER 8
Model Evaluation
In this chapter, the most commonly used methods for testing the quality of a data science model will be formally introduced. Throughout this book, vari- ous validation techniques have been used to split the available data into a training set and a testing set. In the implementation sections, different types of performance operators in conjunction with validation have been used without an in detail explanation of how these operators really function. Several ways in which predictive data science models are evaluated for their performance will now be discussed.
There are a few main tools that are available to test a classification model’s quality: confusion matrices (or truth tables), lift charts, ROC (receiver operator characteristic) curves, area under the curve (AUC). How these tools are con- structed will be defined in detail and how to implement performance eva- luations will be described. To evaluate a numeric prediction from a regression model, there are many conventional statistical tests that may be applied (Black, 2008) a few of which were discussed in Chapter 5, Regression Methods.
DIRECT MARKETING
Direct marketing (DM) companies, which send out postal mail (or in the days before do-not-call lists, they called prospects) were one of the early pioneers in applying data science techniques (Berry, 1999). A key performance indi- cator for their marketing activities is of course the improvement in their bottom line as a result of their utili- zation of predictive models.
Assume that a typical average response rate for a direct mail campaign is 10%. Further assume that: cost per mail sent 5 $1 and potential revenue per response 5 $20.
If they have 10,000 people to send out their mailers to, then they can expect to receive potential revenues of 10,000 3 10% 3 $20 5 $20,000, which would yield a net return of $10,000. Typically, the mailers are sent out in batches to spread costs over a period of time. Further assume that these are sent out in batches of 1000. The first question someone would ask is how to divide the list of names into these batches. If the average expectation of return is 10%, then would it not make a lot of sense to just send one batch of mails to those prospects that make up this 10% and be done with the campaign?
(Continued )
Data Science. DOI: https://doi.org/10.1016/B978-0-12-814761-0.00008-3 © 2019 Elsevier Inc. All rights reserved.
263
8.1 CONFUSION MATRIX
Classification performance is best described by an aptly named tool called the confusion matrix or truth table. Understanding the confusion matrix requires becoming familiar with several definitions. But before introducing the defini- tions, a basic confusion matrix for a binary or binomial classification must first be looked at where there can be two classes (say, Y or N). The accuracy of classification of a specific example can be viewed in one of four possible ways:
� The predicted class is Y, and the actual class is also Y - this is a True Positive or TP
� The predicted class is Y, and the actual class is N - this is a False Positive or FP
� The predicted class is N, and the actual class is Y - this is a False Negative or FN
� The predicted class is N, and the actual class is also N - this is a True Negative or TN
A basic confusion matrix is traditionally arranged as a 2 3 2 matrix as shown in Table 8.1. The predicted classes are arranged horizontally in rows and the
Table 8.1 Confusion Matrix
Actual Class (Observation)
Y N
Predicted class (expectation) Y TP correct result FP unexpected result N FN missing result TN correct absence of result
TP, true positive; FP, false positive; FN, false negative; TN, true negative.
(Continued ) Clearly this would save a lot of time and money and the net return would jump to $19,000!
Can all of these 10 percenters be identified? While this is clearly unrealistic, classification techniques can be used to rank or score prospects by the likelihood that they would respond to the mailers. Predictive analytics is after all about converting future uncertainties into usable prob- abilities (Taylor, 2011). Then a predictive method can be used to order these probabilities and send out the mailers
to only those who score above a particular threshold (say 85% chance of response).
Finally, some techniques may be better suited to this prob- lem than others. How can the different available methods be compared based on their performance? Will logistic regression capture these top 10 percenters better than support vector machines? What are the different metrics that can be used to select the best performing methods? These are some of the things that will be discussed in this chapter in more detail.
264 CHAPTER 8: Model Evaluation
actual classes are arranged vertically in columns, although sometimes this order is reversed (Kohavi & Provost, 1998). A quick way to examine this matrix or a truth table as it is also called is to scan the diagonal from top left to bottom right. An ideal classification performance would only have entries along this main diagonal and the off-diagonal elements would be zero.
These four cases will now be used to introduce several commonly used terms for understanding and explaining classification performance. As mentioned earlier, a perfect classifier will have no entries for FP and FN (i.e., the number of FP 5 number of FN 5 0).
1. Sensitivity is the ability of a classifier to select all the cases that need to be selected. A perfect classifier will select all the actual Y’s and will not miss any actual Y’s. In other words it will have no FNs. In reality, any classifier will miss some true Y’s, and thus, have some FNs. Sensitivity is expressed as a ratio (or percentage) calculated as follows: TP/(TP 1 FN). However, sensitivity alone is not sufficient to evaluate a classifier. In situations such as credit card fraud, where rates are typically around 0.1%, an ordinary classifier may be able to show sensitivity of 99.9% by picking nearly all the cases as legitimate transactions or TP. The ability to detect illegitimate or fraudulent transactions, the TNs, is also needed. This is where the next measure, specificity, which ignores TPs, comes in.
2. Specificity is the ability of a classifier to reject all the cases that need to be rejected. A perfect classifier will reject all the actual N’s and will not deliver any unexpected results. In other words, it will have no FPs. In reality, any classifier will select some cases that need to be rejected, and thus, have some FPs. Specificity is expressed as a ratio (or percentage) calculated as: TN/(TN 1 FP).
3. Relevance is a term that is easy to understand in a document search and retrieval scenario. Suppose a search is run for a specific term and that search returns 100 documents. Of these, let us say only 70 were useful because they were relevant to the search. Furthermore, the search actually missed out on an additional 40 documents that could actually have been useful. With this context, additional terms can be defined.
4. Precision is defined as the proportion of cases found that were actually relevant. From the example, this number was 70, and thus, the precision is 70/100 or 70%. The 70 documents were TP, whereas the remaining 30 were FP. Therefore, precision is TP/(TP 1 FP).
5. Recall is defined as the proportion of the relevant cases that were actually found among all the relevant cases. Again, with the example, only 70 of the total 110 (70 found 1 40 missed) relevant cases were
8.1 Confusion Matrix 265
actually found, thus, giving a recall of 70/110 5 63.63%. It is evident that recall is the same as sensitivity, because recall is also given by TP/(TP 1 FN).
6. Accuracy is defined as the ability of the classifier to select all cases that need to be selected and reject all cases that need to be rejected. For a classifier with 100% accuracy, this would imply that FN 5 FP 5 0. Note that in the document search example, the TN has not been indicated, as this could be really large. Accuracy is given by (TP 1 TN)/ (TP 1 FP 1 TN 1 FN). Finally, error is simply the complement of accuracy, measured by (1 2 accuracy).
Table 8.2 summarizes all the major definitions. Fortunately, the analyst does not need to memorize these equations because their calculations are always automated in any tool of choice. However, it is important to have a good fundamental understanding of these terms.
8.2 ROC AND AUC
Measures like accuracy or precision are essentially aggregates by nature, in the sense that they provide the average performance of the classifier on the dataset. A classifier can have a high accuracy on a dataset but have poor class recall and precision. Clearly, a model to detect fraud is no good if its ability to detect TP for the fraud 5 yes class (and thereby its class recall) is low. It is, therefore, quite useful to look at measures that compare different metrics to see if there is a situation for a trade-off: for example, can a little overall accuracy be sacrificed to gain a lot more improvement in class recall? One can examine a model’s rate of detecting TPs and contrast it with its ability to detect FPs. The receiver operator characteristic (ROC) curves meet this need and were originally developed
Table 8.2 Evaluation Measures
Term Definition Calculation
Sensitivity Ability to select what needs to be selected TP/(TP 1 FN) Specificity Ability to reject what needs to be rejected TN/(TN 1 FP) Precision Proportion of cases found that were relevant TP/(TP 1 FP) Recall Proportion of all relevant cases that were
found TP/(TP 1 FN)
Accuracy Aggregate measure of classifier performance (TP 1 TN)/ (TP 1 TN 1 FP 1 FN)
TP, true positive; FP, false positive; FN, false negative; TN, true negative.
266 CHAPTER 8: Model Evaluation
in the field of signal detection (Green, 1966). A ROC curve is created by plotting the fraction of TPs (TP rate) versus the fraction of FPs (FP rate). When a table of such values is generated, the FP rate can be plotted on the horizontal axis and the TP rate (same as sensitivity or recall) on the vertical axis. The FP can also be expressed as (1 2 specificity) or TN rate.
Consider a classifier that could predict if a website visitor is likely to click on a banner ad: the model would be most likely built using historic click- through rates based on pages visited, time spent on certain pages, and other characteristics of site visitors. In order to evaluate the performance of this model on test data, a table such as the one shown in Table 8.3 can be generated.
The first column “Actual Class” consists of the actual class for a particular example (in this case a website visitor, who has clicked on the banner ad). The next column, “Predicted Class” is the model prediction and the third col- umn, “Confidence of response” is the confidence of this prediction. In order to create a ROC chart, the predicted data will need to be sorted in decreasing order of confidence level, which has been done in this case. By comparing columns Actual class and Predicted class, the type of prediction can be identi- fied: for instance, spreadsheet rows 2 through 5 are all TPs and row 6 is the first instance of a FP. As observed in columns “Number of TP” and “Number of FP,” one can keep a running count of the TPs and FPs and also calculate the fraction of TPs and FPs, which are shown in columns “Fraction of TP” and “Fraction of FP.”
Observing the “Number of TP” and “Number of FP” columns, it is evident that the model has discovered a total of 6 TPs and 4 FPs (the remaining 10 examples are all TNs). It can also be seen that the model has identified nearly 67% of all the TPs before it fails and hits its first FP (row 6 above). Finally, all TPs have been identified (when Fraction of TP 5 1) before the next FP was run into. If Fraction of FP (FP rate) versus Fraction of TP (TP rate) were now to be plotted, then a ROC chart similar to the one shown in Fig. 8.1 would be seen. Clearly an ideal classifier would have an accuracy of 100% (and thus, would have identified 100% of all TPs). Thus, the ROC for an ideal classifier would look like the dashed line shown in Fig. 8.1. Finally, an ordinary or random classifier (which has only a 50% accuracy) would possibly be able to find one FP for every TP, and thus, look like the 45- degree line shown.
As the number of test examples becomes larger, the ROC curve will become smoother: the random classifier will simply look like a straight line drawn between the points (0,0) and (1,1)—the stair steps become extremely small. The area under this random classifier’s ROC curve is basically the area of a
8.2 ROC and AUC 267
right triangle (with side 1 and height 1), which is 0.5. This quantity is termed Area Under the Curve or AUC. AUC for the ideal classifier is 1.0. Thus, the performance of a classifier can also be quantified by its AUC: obviously any AUC higher than 0.5 is better than random and the closer it is to 1.0, the better the performance. A common rule of thumb is to select those classifiers that not only have a ROC curve that is closest to ideal, but also an AUC higher than 0.8. Typical uses for AUC and ROC curves are to compare the performance of different classification algorithms for the same dataset.
8.3 LIFT CURVES
Lift curves or lift charts were first deployed in direct marketing where the problem was to identify if a particular prospect was worth calling or sending an advertisement by mail. It was mentioned in the use case at the beginning
Table 8.3 Classifier Performance Data Needed for Building a ROC Curve
Actual Class Predicted Class Confidence of “Response” Type?
Number of TP
Number of FP
Fraction of FP
Fraction of TP
Response Response 0.902 TP 1 0 0 0.167 Response Response 0.896 TP 2 0 0 0.333 Response Response 0.834 TP 3 0 0 0.500 Response Response 0.741 TP 4 0 0 0.667 No response Response 0.686 FP 4 1 0.25 0.667 Response Response 0.616 TP 5 1 0.25 0.833 Response Response 0.609 TP 6 1 0.25 1 No response Response 0.576 FP 6 2 0.5 1 No response Response 0.542 FP 6 3 0.75 1 No response Response 0.530 FP 6 4 1 1 No response No response 0.440 TN 6 4 1 1 No response No response 0.428 TN 6 4 1 1 No response No response 0.393 TN 6 4 1 1 No response No response 0.313 TN 6 4 1 1 No response No response 0.298 TN 6 4 1 1 No response No response 0.260 TN 6 4 1 1 No response No response 0.248 TN 6 4 1 1 No response No response 0.247 TN 6 4 1 1 No response No response 0.241 TN 6 4 1 1 No response No response 0.116 TN 6 4 1 1
ROC, receiver operator characteristic; TP, true positive; FP, false positive; TN, true negative.
268 CHAPTER 8: Model Evaluation
of this chapter that with a predictive model, one can score a list of prospects by their propensity to respond to an ad campaign. When the prospects are sorted by this score (by the decreasing order of their propensity to respond), one now ends up with a mechanism to systematically select the most valu- able prospects right at the beginning, and thus, maximize their return. Thus, rather than mailing out the ads to a random group of prospects, the ads can now be sent to the first batch of “most likely responders,” followed by the next batch and so on.
Without classification, the “most likely responders” are distributed ran- domly throughout the dataset. Suppose there is a dataset of 200 prospects and it contains a total of 40 responders or TPs. If the dataset is broken up into, say, 10 equal sized batches (called deciles), the likelihood of finding TPs in each batch is also 20%, that is, four samples in each decile will be TPs. However, when a predictive model is used to classify the prospects, a good model will tend to pull these “most likely responders” into the top few deciles. Thus, in this simple example, it might be found that the first two deciles will have all 40 TPs and the remaining eight deciles have none.
1.2
0.8
0.6
0.4
0.2
0 0.20 0.4 0.6
% FP
% T
P
0.8 1 1.2
ROC Ideal ROC Random ROC
1
FIGURE 8.1 Comparing ROC curve for the example shown in Table 8.3 to random and ideal classifiers. ROC, receiver operator characteristic.
8.3 Lift Curves 269
Lift charts were developed to demonstrate this in a graphical way (Rud, 2000). The focus is again on the TPs and, thus, it can be argued that they indicate the sensitivity of the model unlike ROC curves, which can show the relation between sensitivity and specificity.
The motivation for building lift charts was to depict how much better the classifier performs compared to randomly selecting x% of the data (for prospects to call) which would yield x% targets (to call or not). Lift is the improvement over this random selection that a predictive model can potentially yield because of its scoring or ranking ability. For example, in the data from Table 8.3, there are a total of 6 TPs out of 20 test cases. If one were to take the unscored data and randomly select 25% of the examples, it would be expected that 25% of them were TPs (or 25% of 6 5 1.5). However, scoring and reordering the dataset by confidence will improve this. As can be seen in Table 8.4, the first 25% or quartile of scored (reordered) data now con- tains four TPs. This translates to a lift of 4/1.5 5 2.67. Similarly, the second quartile of the unscored data can be expected to contain 50% (or three) of the TPs. As seen in Table 8.4, the scored 50% data contains all six TPs, giving a lift of 6/3 5 2.00.
The steps to build lift charts are:
1. Generate scores for all the data points (prospects) in the test set using the trained model.
2. Rank the prospects by decreasing score or confidence of response. 3. Count the TPs in the first 25% (quartile) of the dataset, and then the
first 50% (add the next quartile) and so on; see columns Cumulative TP and Quartile in Table 8.4.
4. Gain at a given quartile level is the ratio of the cumulative number of TPs in that quartile to the total number of TPs in the entire dataset (six in the example). The 1st quartile gain is, therefore, 4/6 or 67%, the 2nd quartile gain is 6/6 or 100%, and so on.
5. Lift is the ratio of gain to the random expectation at a given quartile level. Remember that random expectation at the xth quartile is x%. In the example, the random expectation is to find 25% of 6 5 1.5 TPs in the 1st quartile, 50% or 3 TPs in the 2nd quartile, and so on. The corresponding 1st quartile lift is, therefore, 4/1.5 5 2.667, the 2nd quartile lift is 6/3 5 2.00, and so on.
The corresponding curves for the simple example are shown in Fig. 8.2. Typically lift charts are created on deciles not quartiles. Quartiles were chosen here because they helped to illustrate the concept using the small 20-sample test dataset. However, the logic remains the same for deciles or any other groupings as well.
270 CHAPTER 8: Model Evaluation
8.4 HOW TO IMPLEMENT
A built-in dataset in RapidMiner will be used to demonstrate how all the three classification performances (confusion matrix, ROC/AUC, and lift/ gain charts) are evaluated. The process shown in Fig. 8.3 uses the Generate Direct Mailing Data operator to create a 10,000 record dataset. The objective of the modeling (Naïve Bayes used here) is to predict whether a person is likely to respond to a direct mailing campaign or not based on demo- graphic attributes (age, lifestyle, earnings, type of car, family status, and sports affinity).
Step 1: Data Preparation Create a dataset with 10,000 examples using the Generate Direct Mailing Data operator by setting a local random seed (default 5 1992) to ensure
Table 8.4 Scoring Predictions and Sorting by Confidences Is the Basis for Generating Lift Curves
Actual Class Predicted Class Confidence of “Response” Type?
Cumulative TP
Cumulative FP Quartile Gain Lift
Response Response 0.902 TP 1 0 1st 67% 2.666667 Response Response 0.896 TP 2 0 1st Response Response 0.834 TP 3 0 1st Response Response 0.741 TP 4 0 1st No response Response 0.686 FP 4 1 1st Response Response 0.616 TP 5 1 2nd 100% 2 Response Response 0.609 TP 6 1 2nd No response Response 0.576 FP 6 2 2nd No response Response 0.542 FP 6 3 2nd No response Response 0.530 FP 6 4 2nd No response No response 0.440 TN 6 4 3rd 100% 1.333333 No response No response 0.428 TN 6 4 3rd No response No response 0.393 TN 6 4 3rd No response No response 0.313 TN 6 4 3rd No response No response 0.298 TN 6 4 3rd No response No response 0.260 TN 6 4 4th 100% 1 No response No response 0.248 TN 6 4 4th No response No response 0.247 TN 6 4 4th No response No response 0.241 TN 6 4 4th No response No response 0.116 TN 6 4 4th
TP, true positive; FP, false positive; TN, true negative.
8.4 How to Implement 271
120%
100%
80%
60%
40%
20%
0% 1 2
Quartile
Lift Gain
3 4 0
0.5
1
1.5 L if
t
G ai
n
2
2.5
3
FIGURE 8.2 Lift and gain curves.
FIGURE 8.3 Process setup to demonstrate typical classification performance metrics.
272 CHAPTER 8: Model Evaluation
repeatability. Convert the label attribute from polynomial (nominal) to binominal using the appropriate operator as shown. This enables one to select specific binominal classification performance measures.
Split data into two partitions: an 80% partition (8000 examples) for model building and validation and a 20% partition for testing. An important point to note is that data partitioning is not an exact science and this ratio can change depending on the data.
Connect the 80% output (upper output port) from the Split Data operator to the Split Validation operator. Select a relative split with a ratio of 0.7 (70% for training) and shuffled sampling.
Step 2: Modeling Operator and Parameters Insert the naïve Bayes operator in the Training panel of the Split Validation operator and the usual Apply Model operator in the Testing panel. Add a Performance (Binomial Classification) operator. Select the following options in the performance operator: accuracy, FP, FN, TP, TN, sensitivity, specific- ity, and AUC.
Step 3: Evaluation Add another Apply Model operator outside the Split Validation operator and deliver the model to its mod input port while connecting the 2000 example data partition from Step 3 to the unl port. Add a Create Lift Chart operator with these options selected: target class 5 response, binning type 5 frequency, and number of bins 5 10. Note the port connections as shown in Fig. 8.3.
Step 4: Execution and Interpretation When the above process is run, the confusion matrix and ROC curve for the validation sample should be generated (30% of the original 80% 5 2400 examples), whereas a lift curve should be generated for the test sample (2000 examples). There is no reason why one cannot add another Performance (Binomial Classification) operator for the test sample or create a lift chart for the validation examples. (The reader should try this as an exer- cise—how will the output from the Create Lift Chart operator be delivered when it is inserted inside the Split Validation operator?)
The confusion matrix shown in Fig. 8.4 is used to calculate several common metrics using the definitions from Table 8.1. Compare them with the RapidMiner outputs to verify understanding.
8.4 How to Implement 273
TP 5 629; TN 5 1231; FP 5 394; FN 5 146
Term Definition Calculation
Sensitivity TP/(TP 1 FN) 629/(629 1 146) 5 81.16% Specificity TN/(TN 1 FP) 1231/(1231 1 394) 5 75.75% Precision TP/(TP 1 FP) 629/(629 1 394) 5 61.5% Recall TP/(TP 1 FN) 629/(629 1 146) 5 81.16% Accuracy (TP 1 TN)/
(TP 1 TN 1 FP 1 FN) (629 1 1231)/(629 1 1231 1 394 1 146) 5 77.5%
Note that RapidMiner makes a distinction between the two classes while cal- culating precision and recall. For example, in order to calculate a class recall for “no response,” the positive class becomes “no response” and the corre- sponding TP is 1231 and the corresponding FN is 394. Therefore, a class recall for “no response” is 1231/(1231 1 394) 5 75.75%, whereas the calcula- tion above assumed that “response” was the positive class. Class recall is an important metric to keep in mind when dealing with highly unbalanced data. Data are considered unbalanced if the proportion of the two classes is skewed. When models are trained on unbalanced data, the resulting class recalls also tend to be skewed. For example, in a dataset where there are only 2% responses, the resulting model can have a high recall for “no responses” but a very low class recall for “responses.” This skew is not seen in the overall model accuracy and using this model on unseen data may result in severe misclassifications.
The solution to this problem is to either balance the training data so that one ends up with a more or less equal proportion of classes or to insert pen- alties or costs on misclassifications using the Metacost operator as discussed in Chapter 5, Regression Methods. Data balancing is explained in more detail in Chapter 13, Anomaly Detection.
The AUC is shown along with the ROC curve in Fig. 8.5. As mentioned ear- lier, AUC values close to 1 are indicative of a good model. The ROC captures the sorted confidences of a prediction. As long as the prediction is correct for the examples the curve takes one step up (increased TP). If the prediction is wrong the curve takes one step to the right (increased FP). RapidMiner can
FIGURE 8.4 Confusion matrix for validation set of direct marketing dataset.
274 CHAPTER 8: Model Evaluation
FIGURE 8.5 ROC curve and AUC. ROC, receiver operator characteristic; AUC, area under the curve.
show two additional AUCs called optimistic and pessimistic. The differences between the optimistic and pessimistic curves occur when there are examples with the same confidence, but the predictions are sometimes false and some- times true. The optimistic curve shows the possibility that the correct predic- tions are chosen first so the curve goes steeper upwards. The pessimistic curve shows the possibility that the wrong predictions are chosen first so the curve increases more gradually.
Finally, the lift chart outputs do not directly indicate the lift values as has been demonstrated with the simple example earlier. In Step 5 of the process, 10 bins were selected for the chart and, thus, each bin will have 200 exam- ples (a decile). Recall that to create a lift chart all the predictions will need to be sorted by the confidence of the positive class (response), which is shown in Fig. 8.6.
The first bar in the lift chart shown in Fig. 8.7 corresponds to the first bin of 200 examples after the sorting. The bar reveals that there are 181 TPs in this bin (as can be seen from the table in Fig. 8.6 that the very second example, Row No. 1973, is an FP). From the confusion matrix earlier, 629 TPs can be seen in this example set. A random classifier would have identified 10% of these or 62.9 TPs in the first 200 examples. Therefore, the lift for the first dec- ile is 181/62.9 5 2.87. Similarly the lift for the first two deciles is (181 1 167)/(2 3 62.9) 5 2.76 and so on. Also, the first decile contains 181/ 629 5 28.8% of the TPs, the first two deciles contain (181 1 167)/ 629 5 55.3% of the TPs, and so on. This is shown in the cumulative (per- cent) gains curve on the right hand y-axis of the lift chart output.
As described earlier, a good classifier will accumulate all the TPs in the first few deciles and will have extremely few FPs at the top of the heap. This will result in a gain curve that quickly rises to the 100% level within the first few deciles.
8.5 CONCLUSION
This chapter covered the basic performance evaluation tools that are typically used in classification methods. Firstly the basic elements of a confusion matrix were described and then the concepts that are important to under- standing it, such as sensitivity, specificity, and accuracy were explored in detail. The ROC curve was then described, which has its origins in signal detection theory and has now been adopted for data science, along with the equally useful aggregate metric of AUC. Finally, two useful tools were described that have their origins in direct marketing applications: lift and gain charts. How to build these curves in general and how they can be
276 CHAPTER 8: Model Evaluation
FIGURE 8.6 Table of scored responses used to build the lift chart.
FIGURE 8.7 Lift chart generated.
constructed using RapidMiner was discussed. In summary, these tools are some of the most commonly used metrics for evaluating predictive models and developing skill and confidence in using these is a prerequisite to devel- oping data science expertise.
One key to developing good predictive models is to know when to use which measures. As discussed earlier, relying on a single measure like accuracy can be misleading. For highly unbalanced datasets, rely on several measures such as class recall and precision in addition to accuracy. ROC curves are fre- quently used to compare several algorithms side by side. Additionally, just as there are an infinite number of triangular shapes that have the same area, AUC should not be used alone to judge a model—AUC and ROCs should be used in conjunction to rate a model’s performance. Finally, lift and gain charts are most commonly used for scoring applications where the examples in a dataset need to be rank-ordered according to their propensity to belong to a particular category.
References Berry, M. A. (1999). Mastering data mining: The art and science of customer relationship management.
New York: John Wiley and Sons.
Black, K. (2008). Business statistics for contemporary decision making. New York: John Wiley and Sons.
Green, D. S. (1966). Signal detection theory and psychophysics. New York: John Wiley and Sons.
Kohavi, R., & Provost, F. (1998). Glossary of terms. Machine Learning, 30, 271�274. Rud, O. (2000). Data mining cookbook: Modeling data for marketing, risk and customer relationship
management. New York: John Wiley and Sons.
Taylor, J. (2011). Decision management systems: A practical guide to using business rules and predictive analytics. Boston, Massachusetts: IBM Press.
References 279
- Cover
- Data Science: Concepts and Practice
- Copyright
- Dedication
- Foreword
- Preface
- Why Data Science?
- Why This Book?
- Who Can Use This Book?
- Acknowledgments
- 1 Introduction
- 1.1 AI, Machine learning, and Data Science
- 1.2 What is Data Science?
- 1.2.1 Extracting Meaningful Patterns
- 1.2.2 Building Representative Models
- 1.2.3 Combination of Statistics, Machine Learning, and Computing
- 1.2.4 Learning Algorithms
- 1.2.5 Associated Fields
- 1.3 Case for Data Science
- 1.3.1 Volume
- 1.3.2 Dimensions
- 1.3.3 Complex Questions
- 1.4 Data Science Classification
- 1.5 Data Science Algorithms
- 1.6 Roadmap for This Book
- 1.6.1 Getting Started With Data Science
- 1.6.2 Practice using RapidMiner
- 1.6.3 Core Algorithms
- References
- 2 Data Science Process
- 2.1 Prior Knowledge
- 2.1.1 Objective
- 2.1.2 Subject Area
- 2.1.3 Data
- 2.1.4 Causation Versus Correlation
- 2.2 Data Preparation
- 2.2.1 Data Exploration
- 2.2.2 Data Quality
- 2.2.3 Missing Values
- 2.2.4 Data Types and Conversion
- 2.2.5 Transformation
- 2.2.6 Outliers
- 2.2.7 Feature Selection
- 2.2.8 Data Sampling
- 2.3 Modeling
- 2.3.1 Training and Testing Datasets
- 2.3.2 Learning Algorithms
- 2.3.3 Evaluation of the Model
- 2.3.4 Ensemble Modeling
- 2.4 Application
- 2.4.1 Production Readiness
- 2.4.2 Technical Integration
- 2.4.3 Response Time
- 2.4.4 Model Refresh
- 2.4.5 Assimilation
- 2.5 Knowledge
- References
- 3 Data Exploration
- 3.1 Objectives of Data Exploration
- 3.2 Datasets
- 3.2.1 Types of Data
- Numeric or Continuous
- Categorical or Nominal
- 3.3 Descriptive Statistics
- 3.3.1 Univariate Exploration
- Measure of Central Tendency
- Measure of Spread
- 3.3.2 Multivariate Exploration
- Central Data Point
- Correlation
- 3.4 Data Visualization
- 3.4.1 Univariate Visualization
- Histogram
- Quartile
- Distribution Chart
- 3.4.2 Multivariate Visualization
- Scatterplot
- Scatter Multiple
- Scatter Matrix
- Bubble Chart
- Density Chart
- 3.4.3 Visualizing High-Dimensional Data
- Parallel Chart
- Deviation Chart
- Andrews Curves
- 3.5 Roadmap for Data Exploration
- References
- 4 Classification
- 4.1 Decision Trees
- 4.1.1 How It Works
- Step 1: Where to Split Data?
- Step 2: When to Stop Splitting Data?
- 4.1.2 How to Implement
- Implementation 1: To Play Golf or Not?
- Implementation 2: Prospect Filtering
- Step 1: Data Preparation
- Step 2: Divide dataset Into Training and Testing Samples
- Step 3: Modeling Operator and Parameters
- Step 4: Configuring the Decision Tree Model
- Step 5: Process Execution and Interpretation
- 4.1.3 Conclusion
- 4.2 Rule Induction
- WARNING!!! DUMMY ENTRY
- Approaches to Developing a Rule Set
- 4.2.1 How It Works
- Step 1: Class Selection
- Step 2: Rule Development
- Step 3: Learn-One-Rule
- Step 4: Next Rule
- Step 5: Development of Rule Set
- 4.2.2 How to Implement
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Results Interpretation
- Alternative Approach: Tree-to-Rules
- 4.2.3 Conclusion
- 4.3 k-Nearest Neighbors
- 4.3.1 How It Works
- Measure of Proximity
- Distance
- Weights
- Correlation similarity
- Simple matching coefficient
- Jaccard similarity
- Cosine similarity
- 4.3.2 How to Implement
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Execution and Interpretation
- 4.3.3 Conclusion
- 4.4 Naïve Bayesian
- 4.4.1 How It Works
- Step 1: Calculating Prior Probability P(Y)
- Step 2: Calculating Class Conditional Probability P(Xi|Y)
- Step 3: Predicting the Outcome Using Bayes’ Theorem
- Issue 1: Incomplete Training Set
- Issue 2: Continuous Attributes
- Issue 3: Attribute Independence
- 4.4.2 How to Implement
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Evaluation
- Step 4: Execution and Interpretation
- 4.4.3 Conclusion
- 4.5 Artificial Neural Networks
- 4.5.1 How It Works
- Step 1: Determine the Topology and Activation Function
- Step 2: Initiation
- Step 3: Calculating Error
- Step 4: Weight Adjustment
- 4.5.2 How to Implement
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Evaluation
- Step 4: Execution and Interpretation
- 4.5.3 Conclusion
- 4.6 Support Vector Machines
- WARNING!!! DUMMY ENTRY
- Concept and Terminology
- 4.6.1 How It Works
- 4.6.2 How to Implement
- Implementation 1: Linearly Separable Dataset
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Process Execution and Interpretation
- Example 2: Linearly Non-Separable Dataset
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Execution and Interpretation
- Parameter Settings
- 4.6.3 Conclusion
- 4.7 Ensemble Learners
- WARNING!!! DUMMY ENTRY
- Wisdom of the Crowd
- 4.7.1 How It Works
- Achieving the Conditions for Ensemble Modeling
- 4.7.2 How to Implement
- Ensemble by Voting
- Bootstrap Aggregating or Bagging
- Implementation
- Boosting
- AdaBoost
- Implementation
- Random Forest
- Implementation
- 4.7.3 Conclusion
- References
- 5 Regression Methods
- 5.1 Linear Regression
- 5.1.1 How it Works
- 5.1.2 How to Implement
- Step 1: Data Preparation
- Step 2: Model Building
- Step 3: Execution and Interpretation
- Step 4: Application to Unseen Test Data
- 5.1.3 Checkpoints
- 5.2 Logistic Regression
- 5.2.1 How It Works
- How Does Logistic Regression Find the Sigmoid Curve?
- A Simple but Tragic Example
- 5.2.2 How to Implement
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Execution and Interpretation
- Step 4: Using MetaCost
- Step 5: Applying the Model to an Unseen Dataset
- 5.2.3 Summary Points
- 5.3 Conclusion
- References
- 6 Association Analysis
- 6.1 Mining Association Rules
- 6.1.1 Itemsets
- Support
- Confidence
- Lift
- Conviction
- 6.1.2 Rule Generation
- 6.2 Apriori Algorithm
- 6.2.1 How it Works
- Frequent Itemset Generation
- Rule Generation
- 6.3 Frequent Pattern-Growth Algorithm
- 6.3.1 How it Works
- Frequent Itemset Generation
- 6.3.2 How to Implement
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Create Association Rules
- Step 4: Interpreting the Results
- 6.4 Conclusion
- References
- 7 Clustering
- Clustering to Describe the Data
- Clustering for Preprocessing
- Types of Clustering Techniques
- 7.1 k-Means Clustering
- 7.1.1 How It Works
- Step 1: Initiate Centroids
- Step 2: Assign Data Points
- Step 3: Calculate New Centroids
- Step 4: Repeat Assignment and Calculate New Centroids
- Step 5: Termination
- Special Cases
- Evaluation of Clusters
- 7.1.2 How to Implement
- Step 1: Data Preparation
- Step 2: Clustering Operator and Parameters
- Step 3: Evaluation
- Step 4: Execution and Interpretation
- 7.2 DBSCAN Clustering
- 7.2.1 How It Works
- Step 1: Defining Epsilon and MinPoints
- Step 2: Classification of Data Points
- Step 3: Clustering
- Optimizing Parameters
- Special Cases: Varying Densities
- 7.2.2 How to Implement
- Step 1: Data Preparation
- Step 2: Clustering Operator and Parameters
- Step 3: Evaluation
- Step 4: Execution and Interpretation
- 7.3 Self-Organizing Maps
- 7.3.1 How It Works
- Step 1: Topology Specification
- Step 2: Initialize Centroids
- Step 3: Assignment of Data Objects
- Step 4: Centroid Update
- Step 5: Termination
- Step 6: Mapping a New Data Object
- 7.3.2 How to Implement
- Step 1: Data Preparation
- Step 2: SOM Modeling Operator and Parameters
- Step 3: Execution and Interpretation
- Visual Model
- Location Coordinates
- Conclusion
- References
- 8 Model Evaluation
- 8.1 Confusion Matrix
- 8.2 ROC and AUC
- 8.3 Lift Curves
- 8.4 How to Implement
- WARNING!!! DUMMY ENTRY
- Step 1: Data Preparation
- Step 2: Modeling Operator and Parameters
- Step 3: Evaluation
- Step 4: Execution and Interpretation
- 8.5 Conclusion
- References
- 9 Text Mining
- 9.1 How It Works
- 9.1.1 Term Frequency–Inverse Document Frequency
- 9.1.2 Terminology
- 9.2 How to Implement
- 9.2.1 Implementation 1: Keyword Clustering
- Step 1: Gather Unstructured Data
- Step 2: Data Preparation
- Step 3: Apply Clustering
- 9.2.2 Implementation 2: Predicting the Gender of Blog Authors
- Step 1: Gather Unstructured Data
- Step 2: Data Preparation
- Step 3.1: Identify Key Features
- Step 3.2: Build Models
- Step 4.1: Prepare Test Data for Model Application
- Step 4.2: Applying the Trained Models to Testing Data
- Bias in Machine Learning
- 9.3 Conclusion
- References
- 10 Deep Learning
- 10.1 The AI Winter
- AI Winter: 1970’s
- Mid-Winter Thaw of the 1980s
- The Spring and Summer of Artificial Intelligence: 2006—Today
- 10.2 How it works
- 10.2.1 Regression Models As Neural Networks
- 10.2.2 Gradient Descent
- 10.2.3 Need for Backpropagation
- 10.2.4 Classifying More Than 2 Classes: Softmax
- 10.2.5 Convolutional Neural Networks
- 10.2.6 Dense Layer
- 10.2.7 Dropout Layer
- 10.2.8 Recurrent Neural Networks
- 10.2.9 Autoencoders
- 10.2.10 Related AI Models
- 10.3 How to Implement
- WARNING!!! DUMMY ENTRY
- Handwritten Image Recognition
- Step 1: Dataset Preparation
- Step 2: Modeling using the Keras Model
- Step 3: Applying the Keras Model
- Step 4: Results
- 10.4 Conclusion
- References
- 11 Recommendation Engines
- Why Do We Need Recommendation Engines?
- Applications of Recommendation Engines
- 11.1 Recommendation Engine Concepts
- WARNING!!! DUMMY ENTRY
- Building up the Ratings Matrix
- Step 1: Assemble Known Ratings
- Step 2: Rating Prediction
- Step 3: Evaluation
- The Balance
- 11.1.1 Types of Recommendation Engines
- 11.2 Collaborative Filtering
- 11.2.1 Neighborhood-Based Methods
- User-Based Collaborative Filtering
- Step 1: Identifying Similar Users
- Step 2: Deducing Rating From Neighborhood Users
- Item-Based Collaborative Filtering
- User-Based or Item-Based Collaborative Filtering?
- Neighborhood based Collaborative Filtering - How to Implement
- Dataset
- Implementation Steps
- Conclusion
- 11.2.2 Matrix Factorization
- Matrix Factorization - How to Implement
- Implementation Steps
- 11.3 Content-Based Filtering
- WARNING!!! DUMMY ENTRY
- Building an Item Profile
- 11.3.1 User Profile Computation
- Content-Based Filtering - How to Implement
- Dataset
- Implementation steps
- 11.3.2 Supervised Learning Models
- Supervised Learning Models - How to Implement
- Dataset
- Implementation steps
- 11.4 Hybrid Recommenders
- 11.5 Conclusion
- WARNING!!! DUMMY ENTRY
- Summary of the Types of Recommendation Engines
- References
- 12 Time Series Forecasting
- Taxonomy of Time Series Forecasting
- 12.1 Time Series Decomposition
- 12.1.1 Classical Decomposition
- 12.1.2 How to Implement
- Forecasting Using Decomposed Data
- 12.2 Smoothing Based Methods
- 12.2.1 Simple Forecasting Methods
- Naïve Method
- Seasonal Naive Method
- Average Method
- Moving Average Smoothing
- Weighted Moving Average Smoothing
- 12.2.2 Exponential Smoothing
- Holt’s Two-Parameter Exponential Smoothing
- Holt-Winters’ Three-Parameter Exponential Smoothing
- 12.2.3 How to Implement
- R Script for Holt-Winters’ Forecasting
- 12.3 Regression Based Methods
- 12.3.1 Regression
- 12.3.2 Regression With Seasonality
- How to implement
- 12.3.3 Autoregressive Integrated Moving Average
- Autocorrelation
- Autoregressive Models
- Stationary Data
- Differencing
- Moving Average of Error
- Autoregressive Integrated Moving Average
- How to Implement
- 12.3.4 Seasonal ARIMA
- How to Implement
- 12.4 Machine Learning Methods
- 12.4.1 Windowing
- Model Training
- How to Implement
- Step 1: Set Up Windowing
- Step 2: Train the Model
- Step 3: Generate the Forecast in a Loop
- 12.4.2 Neural Network Autoregressive
- How to Implement
- 12.5 Performance Evaluation
- 12.5.1 Validation Dataset
- Mean Absolute Error
- Root Mean Squared Error
- Mean Absolute Percentage Error
- Mean Absolute Scaled Error
- 12.5.2 Sliding Window Validation
- 12.6 Conclusion
- 12.6.1 Forecasting Best Practices
- References
- 13 Anomaly Detection
- 13.1 Concepts
- 13.1.1 Causes of Outliers
- 13.1.2 Anomaly Detection Techniques
- Outlier Detection Using Statistical Methods
- Outlier Detection Using Data Science
- 13.2 Distance-Based Outlier Detection
- 13.2.1 How It Works
- 13.2.2 How to Implement
- Step 1: Data Preparation
- Step 2: Detect Outlier Operator
- Step 3: Execution and Interpretation
- 13.3 Density-Based Outlier Detection
- 13.3.1 How It Works
- 13.3.2 How to Implement
- Step 1: Data Preparation
- Step 2: Detect Outlier Operator
- Step 3: Execution and Interpretation
- 13.4 Local Outlier Factor
- 13.4.1 How it Works
- 13.4.2 How to Implement
- Step 1: Data Preparation
- Step 2: Detect Outlier Operator
- Step 3: Results Interpretation
- 13.5 Conclusion
- References
- 14 Feature Selection
- 14.1 Classifying Feature Selection Methods
- 14.2 Principal Component Analysis
- 14.2.1 How It Works
- 14.2.2 How to Implement
- Step 1: Data Preparation
- Step 2: PCA Operator
- Step 3: Execution and Interpretation
- 14.3 Information Theory-Based Filtering
- 14.4 Chi-Square-Based Filtering
- 14.5 Wrapper-Type Feature Selection
- 14.5.1 Backward Elimination
- 14.6 Conclusion
- References
- 15 Getting Started with RapidMiner
- 15.1 User Interface and Terminology
- WARNING!!! DUMMY ENTRY
- Terminology
- 15.2 Data Importing and Exporting Tools
- 15.3 Data Visualization Tools
- WARNING!!! DUMMY ENTRY
- Univariate Plots
- Bivariate Plots
- Multivariate Plots
- 15.4 Data Transformation Tools
- 15.5 Sampling and Missing Value Tools
- 15.6 Optimization Tools5
- 15.7 Integration with R
- 15.8 Conclusion
- References
- Comparison of Data Science Algorithms
- About the Authors
- Vijay Kotu
- Bala Deshpande, PhD
- Index
- Praise
- Back Cover