Computer Science
02/05/2020 Introduction to Data Mining, 2nd Edition 1
Data Mining
Model Overfitting
Introduction to Data Mining, 2nd Edition by
Tan, Steinbach, Karpatne, Kumar
02/05/2020 Introduction to Data Mining, 2nd Edition 2
Classification Errors
� Training errors (apparent errors) – Errors committed on the training set
� Test errors – Errors committed on the test set
� Generalization errors – Expected error of a model over random
selection of records from same distribution
1
2
02/05/2020 Introduction to Data Mining, 2nd Edition 3
Example Data Set
Two class problem:
+ : 5400 instances
• 5000 instances generated from a Gaussian centered at (10,10)
• 400 noisy instances added
o : 5400 instances • Generated from a uniform distribution
10 % of the data used for training and 90% of the data used for testing
02/05/2020 Introduction to Data Mining, 2nd Edition 4
Increasing number of nodes in Decision Trees
3
4
02/05/2020 Introduction to Data Mining, 2nd Edition 5
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
02/05/2020 Introduction to Data Mining, 2nd Edition 6
Decision Tree with 50 nodes
Decision TreeDecision Tree
Decision boundaries on Training data
5
6
02/05/2020 Introduction to Data Mining, 2nd Edition 7
Which tree is better?
Decision Tree with 4 nodes
Decision Tree with 50 nodes Which tree is better ?
02/05/2020 Introduction to Data Mining, 2nd Edition 8
Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
•As the model becomes more and more complex, test errors can start increasing even though training error may be decreasing
7
8
02/05/2020 Introduction to Data Mining, 2nd Edition 9
Model Overfitting
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and testing errors at a given size of model
02/05/2020 Introduction to Data Mining, 2nd Edition 10
Model Overfitting
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and testing errors at a given size of model
Decision Tree with 50 nodes Decision Tree with 50 nodes
9
10
02/05/2020 Introduction to Data Mining, 2nd Edition 11
Reasons for Model Overfitting
� Limited Training Size
� High Model Complexity – Multiple Comparison Procedure
02/05/2020 Introduction to Data Mining, 2nd Edition 12
Effect of Multiple Comparison Procedure
� Consider the task of predicting whether stock market will rise/fall in the next 10 trading days
� Random guessing: P(correct) = 0.5
� Make 10 random guesses in a row:
Day 1 Up Day 2 Down Day 3 Down Day 4 Up Day 5 Down Day 6 Down Day 7 Up Day 8 Up Day 9 Up Day 10 Down
0547.0 2
10 10
9 10
8 10
)8(# 10
correctP
11
12
02/05/2020 Introduction to Data Mining, 2nd Edition 13
Effect of Multiple Comparison Procedure
� Approach: – Get 50 analysts – Each analyst makes 10 random guesses – Choose the analyst that makes the most
number of correct predictions
� Probability that at least one analyst makes at least 8 correct predictions
9399.0)0547.01(1)8(# 50 correctP
02/05/2020 Introduction to Data Mining, 2nd Edition 14
Effect of Multiple Comparison Procedure
� Many algorithms employ the following greedy strategy: – Initial model: M – Alternative model: M’ = M ,
where is a component to be added to the model (e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) >
� Often times, is chosen from a set of alternative components, = {1, 2, …, k}
� If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting
13
14
02/05/2020 Introduction to Data Mining, 2nd Edition 15
Effect of Multiple Comparison - Example
Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes.
Use 30% of the data for training and 70% of the data for testing
Using only X and Y as attributes
02/05/2020 Introduction to Data Mining, 2nd Edition 16
Notes on Overfitting
� Overfitting results in decision trees that are more complex than necessary
� Training error does not provide a good estimate of how well the tree will perform on previously unseen records
� Need ways for estimating generalization errors
15
16
02/05/2020 Introduction to Data Mining, 2nd Edition 17
Model Selection
� Performed during model building � Purpose is to ensure that model is not overly
complex (to avoid overfitting)
� Need to estimate generalization error – Using Validation Set
– Incorporating Model Complexity
– Estimating Statistical Bounds
02/05/2020 Introduction to Data Mining, 2nd Edition 18
Model Selection:
Using Validation Set � Divide training data into two parts:
– Training set: use for model building
– Validation set: use for estimating generalization error Note: validation set is not the same as test set
� Drawback: – Less data available for training
17
18
02/05/2020 Introduction to Data Mining, 2nd Edition 19
Model Selection:
Incorporating Model Complexity � Rationale: Occam’s Razor
– Given two models of similar generalization errors, one should prefer the simpler model over the more complex model
– A complex model has a greater chance of being fitted accidentally
– Therefore, one should include model complexity when evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) + x Complexity(Model)
02/05/2020 Introduction to Data Mining, 2nd Edition 20
Estimating the Complexity of Decision Trees
� Pessimistic Error Estimate of decision tree T with k leaf nodes:
– err(T): error rate on all training records – : trade-off hyper-parameter (similar to )
Relative cost of adding a leaf node
– k: number of leaf nodes – Ntrain: total number of training records
19
20
02/05/2020 Introduction to Data Mining, 2nd Edition 21
Estimating the Complexity of Decision Trees: Example
e(TL) = 4/24
e(TR) = 6/24
= 1
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
02/05/2020 Introduction to Data Mining, 2nd Edition 22
Estimating the Complexity of Decision Trees
� Resubstitution Estimate: – Using training error as an optimistic estimate of
generalization error – Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
21
22
02/05/2020 Introduction to Data Mining, 2nd Edition 23
Minimum Description Length (MDL)
� Cost(Model,Data) = Cost(Data|Model) + x Cost(Model) – Cost is the number of bits needed for encoding. – Search for the least costly model.
� Cost(Data|Model) encodes the misclassification errors. � Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
A B
A?
B?
C?
10
0
1
Yes No
B1 B2
C1 C2
X y X1 1 X2 0 X3 0 X4 1 … … Xn 1
X y X1 ? X2 ? X3 ? X4 ? … … Xn ?
02/05/2020 Introduction to Data Mining, 2nd Edition 24
Estimating Statistical Bounds
Before splitting: e = 2/7, e’(7, 2/7, 0.25) = 0.503
e’(T) = 7 0.503 = 3.521
After splitting:
e(TL) = 1/4, e’(4, 1/4, 0.25) = 0.537
e(TR) = 1/3, e’(3, 1/3, 0.25) = 0.650
e’(T) = 4 0.537 + 3 0.650 = 4.098
N z
N z
N ee
z N
z e
eNe 2 2/
2
2 2/
2/
2 2/
1
4 )1(
2),,('
Therefore, do not split
23
24
02/05/2020 Introduction to Data Mining, 2nd Edition 25
Model Selection for Decision Trees
� Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node:
Stop if all instances belong to the same class Stop if all the attribute values are the same
– More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain). Stop if estimated generalization error falls below certain threshold
02/05/2020 Introduction to Data Mining, 2nd Edition 26
Model Selection for Decision Trees
� Post-pruning – Grow decision tree to its entirety – Subtree replacement
Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node Class label of leaf node is determined from majority class of instances in the sub-tree
– Subtree raising Replace subtree with most frequently used branch
25
26
02/05/2020 Introduction to Data Mining, 2nd Edition 27
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10 Error = 10/30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4 0.5)/30 = 11/30
PRUNE!
Class = Yes 8 Class = No 4
Class = Yes 3 Class = No 4
Class = Yes 4 Class = No 1
Class = Yes 5 Class = No 1
02/05/2020 Introduction to Data Mining, 2nd Edition 28
Examples of Post-pruning
27
28
02/05/2020 Introduction to Data Mining, 2nd Edition 29
Model Evaluation
� Purpose: – To estimate performance of classifier on previously
unseen data (test set)
� Holdout – Reserve k% for training and (100-k)% for testing – Random subsampling: repeated holdout
� Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n
02/05/2020 Introduction to Data Mining, 2nd Edition 30
Cross-validation Example
� 3-fold cross-validation
29
30
02/05/2020 Introduction to Data Mining, 2nd Edition 31
Variations on Cross-validation
� Repeated cross-validation – Perform cross-validation a number of times – Gives an estimate of the variance of the
generalization error � Stratified cross-validation
– Guarantee the same percentage of class labels in training and test
– Important when classes are imbalanced and the sample is small
� Use nested cross-validation approach for model selection and evaluation
31