DAta mining week 7 assignment
Dr. Oner Celepcikay
ITS 632
ITS 632
Machine Learning
Week 5
Classification Part II
Tree Induction
● Greedy strategy. – Split the records based on an attribute test
that optimizes certain criterion.
● Issues – Determine how to split the records
uHow to specify the attribute test condition? uHow to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction
● Stop expanding a node when all the records belong to the same class
● Stop expanding a node when all the records have similar attribute values
● Early termination (to be discussed later)
Practical Issues of Classification
● Underfitting and Overfitting
● Missing Values
● Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of mammals.
Overfitting due to Noise
Decision boundary is distorted by noise point
• Both humans and dolphins were misclassified as n0n- mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.
• Spiny anteaters represent an exceptional case (every warm- blooded with no gives_birth is non-mammal in TR_Set
• Decision tree perfectly fits training data (training error=0)
• But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization Errors
● Re-substitution errors: error on training (S e(t) ) ● Generalization errors: error on testing (S e�(t)) ● Methods for estimating generalization errors:
– Optimistic approach: e�(t) = e(t) – Pessimistic approach:
u For each leaf node: e�(t) = (e(t)+0.5) u Total errors: e�(T) = e(T) + N ´ 0.5 (N: number of leaf nodes) u For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 30´0.5)/1000 = 2.5%
– Reduced error pruning (REP): u uses validation data set to estimate generalization
error
Occam�s Razor
● Given two models of similar generalization errors, one should prefer the simpler model over the more complex model
● For complex models, there is a greater chance that it was fitted accidentally by errors in data
● Therefore, one should include model complexity when evaluating a model
How to Address Overfitting
● Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node:
u Stop if all instances belong to the same class u Stop if all the attribute values are the same
– More restrictive conditions: u Stop if number of instances is less than some user-specified threshold u Stop if class distribution of instances are independent of the available features (e.g., using c 2 test) u Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…
● Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a
bottom-up fashion – If generalization error improves after trimming,
replace sub-tree by a leaf node. – Class label of leaf node is determined from
majority class of instances in the sub-tree
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10 Error = ?
Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 7/30
Pessimistic error (After splitting) = (7 + 4 ´ 0.5)/30 = 9/30
PRUNE OR DO NOT PRUNE
Class = Yes 8 Class = No 4
Class = Yes 2 Class = No 5
Class = Yes 6 Class = No 1
Class = Yes 4 Class = No 0
Handling Missing Attribute Values
● Missing values affect decision tree construction in three different ways: – Affects how impurity measures are computed – Affects how to distribute instance with missing
value to child nodes – Affects how a test instance with missing value
is classified
Model Evaluation
● Metrics for Performance Evaluation – How to evaluate the performance of a model?
● Methods for Performance Evaluation – How to obtain reliable estimates?
● Methods for Model Comparison – How to compare the relative performance
among competing models?
Model Evaluation
● Metrics for Performance Evaluation – How to evaluate the performance of a model?
● Methods for Performance Evaluation – How to obtain reliable estimates?
● Methods for Model Comparison – How to compare the relative performance
among competing models?
Metrics for Performance Evaluation
● Focus on the predictive capability of a model – Rather than how fast it takes to classify or
build models, scalability, etc. ● Confusion Matrix:
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
Metrics for Performance Evaluation…
● Most widely-used metric:
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No
Class=Yes a (TP)
b (FN)
Class=No c (FP)
d (TN)
FNFPTNTP TNTP
dcba da
+++ +
= +++
+ =Accuracy
Limitation of Accuracy
● Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10
● If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does
not detect any class 1 example
Cost Matrix
PREDICTED CLASS
ACTUAL CLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class i
Computing Cost of Classification
Cost Matrix
PREDICTED CLASS
ACTUAL CLASS
C(i|j) + - + -1 100 - 1 0
Model M1 PREDICTED CLASS
ACTUAL CLASS
+ - + 150 40 - 60 250
Model M2 PREDICTED CLASS
ACTUAL CLASS
+ - + 250 45 - 5 200
Accuracy = 80% Cost = 3910
Accuracy = 90% Cost = 4255
Cost vs Accuracy
Count PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
Cost PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No
Class=Yes p q
Class=No q p
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) ´ Accuracy]
Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p
Model Evaluation
● Metrics for Performance Evaluation – How to evaluate the performance of a model?
● Methods for Performance Evaluation – How to obtain reliable estimates?
● Methods for Model Comparison – How to compare the relative performance
among competing models?
Methods for Performance Evaluation
● How to obtain a reliable estimate of performance?
● Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets
Methods of Estimation ● Holdout
– Reserve 2/3 for training and 1/3 for testing ● Random subsampling
– Repeated holdout ● Cross validation
– Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n
● Stratified sampling – oversampling vs undersampling
● Bootstrap – Sampling with replacement