Decision Trees
Tree Induction
- Greedy strategy.
- Split the records based on an attribute test that optimizes certain criterion.
- Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
Stopping Criteria for Tree Induction
- Stop expanding a node when all the records belong to the same class
- Stop expanding a node when all the records have similar attribute values
- Early termination (to be discussed later)
Practical Issues of Classification
- Underfitting and Overfitting
- Missing Values
- Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of mammals.
Overfitting due to Noise
Decision boundary is distorted by noise point
Both humans and dolphins were misclassified as n0n-mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization Errors
- Re-substitution errors: error on training ( e(t) )
- Generalization errors: error on testing ( e’(t))
- Methods for estimating generalization errors:
- Optimistic approach: e’(t) = e(t)
- Pessimistic approach:
- For each leaf node: e’(t) = (e(t)+0.5)
- Total errors: e’(T) = e(T) + N 0.5 (N: number of leaf nodes)
- For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%
- Reduced error pruning (REP):
- uses validation data set to estimate generalization
error
Occam’s Razor
- Given two models of similar generalization errors, one should prefer the simpler model over the more complex model
- For complex models, there is a greater chance that it was fitted accidentally by errors in data
- Therefore, one should include model complexity when evaluating a model
How to Address Overfitting
- Pre-Pruning (Early Stopping Rule)
- Stop the algorithm before it becomes a fully-grown tree
- Typical stopping conditions for a node:
- Stop if all instances belong to the same class
- Stop if all the attribute values are the same
- More restrictive conditions:
- Stop if number of instances is less than some user-specified threshold
- Stop if class distribution of instances are independent of the available features (e.g., using 2 test)
- Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…
- Post-pruning
- Grow decision tree to its entirety
- Trim the nodes of the decision tree in a bottom-up fashion
- If generalization error improves after trimming, replace sub-tree by a leaf node.
- Class label of leaf node is determined from majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 7/30
Pessimistic error (After splitting)
= (7 + 4 0.5)/30 = 9/30
PRUNE OR DO NOT PRUNE
| Class = Yes | 20 |
| Class = No | 10 |
| Error = ? |
| Class = Yes | 8 |
| Class = No | 4 |
| Class = Yes | 2 |
| Class = No | 5 |
| Class = Yes | 6 |
| Class = No | 1 |
| Class = Yes | 4 |
| Class = No | 0 |
Handling Missing Attribute Values
- Missing values affect decision tree construction in three different ways:
- Affects how impurity measures are computed
- Affects how to distribute instance with missing value to child nodes
- Affects how a test instance with missing value is classified
Model Evaluation
- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among competing models?
Model Evaluation
- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among competing models?
Metrics for Performance Evaluation
- Focus on the predictive capability of a model
- Rather than how fast it takes to classify or build models, scalability, etc.
- Confusion Matrix:
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
| PREDICTED CLASS | ||
| ACTUAL CLASS | Class=Yes | Class=No |
| Class=Yes | a | b |
| Class=No | c | d |
Metrics for Performance Evaluation…
- Most widely-used metric:
| PREDICTED CLASS | ||
| ACTUAL CLASS | Class=Yes | Class=No |
| Class=Yes | a (TP) | b (FN) |
| Class=No | c (FP) | d (TN) |
Limitation of Accuracy
- Consider a 2-class problem
- Number of Class 0 examples = 9990
- Number of Class 1 examples = 10
- If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
- Accuracy is misleading because model does not detect any class 1 example
Cost Matrix
C(i|j): Cost of misclassifying class j example as class i
| PREDICTED CLASS | |||
| ACTUAL CLASS | C(i|j) | Class=Yes | Class=No |
| Class=Yes | C(Yes|Yes) | C(No|Yes) | |
| Class=No | C(Yes|No) | C(No|No) |
Computing Cost of Classification
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
| Cost Matrix | PREDICTED CLASS | ||
| ACTUAL CLASS | C(i|j) | + | - |
| + | -1 | 100 | |
| - | 1 | 0 |
| Model M1 | PREDICTED CLASS | |
| ACTUAL CLASS | + | - |
| + | 150 | 40 |
| - | 60 | 250 |
| Model M2 | PREDICTED CLASS | |
| ACTUAL CLASS | + | - |
| + | 250 | 45 |
| - | 5 | 200 |
Cost vs Accuracy
| Count | PREDICTED CLASS | |
| ACTUAL CLASS | Class=Yes | Class=No |
| Class=Yes | a | b |
| Class=No | c | d |
| Cost | PREDICTED CLASS | |
| ACTUAL CLASS | Class=Yes | Class=No |
| Class=Yes | p | q |
| Class=No | q | p |
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) Accuracy]
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
Model Evaluation
- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among competing models?
Methods for Performance Evaluation
- How to obtain a reliable estimate of performance?
- Performance of a model may depend on other factors besides the learning algorithm:
- Class distribution
- Cost of misclassification
- Size of training and test sets
Methods of Estimation
- Holdout
- Reserve 2/3 for training and 1/3 for testing
- Random subsampling
- Repeated holdout
- Cross validation
- Partition data into k disjoint subsets
- k-fold: train on k-1 partitions, test on the remaining one
- Leave-one-out: k=n
- Stratified sampling
- oversampling vs undersampling
- Bootstrap
- Sampling with replacement
A?
A1
A2
A3
A4
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+
+
+
+
=
+
+
+
+
=
Accuracy