Decision Trees

profileShrav
PPT.ppt

Tree Induction

  • Greedy strategy.
  • Split the records based on an attribute test that optimizes certain criterion.
  • Issues
  • Determine how to split the records
  • How to specify the attribute test condition?
  • How to determine the best split?
  • Determine when to stop splitting

Stopping Criteria for Tree Induction

  • Stop expanding a node when all the records belong to the same class
  • Stop expanding a node when all the records have similar attribute values
  • Early termination (to be discussed later)

Practical Issues of Classification

  • Underfitting and Overfitting
  • Missing Values
  • Costs of Classification

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise

Decision boundary is distorted by noise point

Overfitting due to Noise

* Bats and Whales are misclassified; non-mammals instead of mammals.

Overfitting due to Noise

Decision boundary is distorted by noise point

Both humans and dolphins were misclassified as n0n-mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.

Spiny anteaters represent an exceptional case (every warm-blooded with no gives_birth is non-mammal in TR_Set

Decision tree perfectly fits training data (training error=0)

But error rate on test data is 30%.

Overfitting due to Noise

Estimating Generalization Errors

  • Re-substitution errors: error on training ( e(t) )
  • Generalization errors: error on testing ( e’(t))

  • Methods for estimating generalization errors:
  • Optimistic approach: e’(t) = e(t)
  • Pessimistic approach:
  • For each leaf node: e’(t) = (e(t)+0.5)
  • Total errors: e’(T) = e(T) + N  0.5 (N: number of leaf nodes)
  • For a tree with 30 leaf nodes and 10 errors on training
    (out of 1000 instances):
    Training error = 10/1000 = 1%

Generalization error = (10 + 300.5)/1000 = 2.5%

  • Reduced error pruning (REP):
  • uses validation data set to estimate generalization
    error

Occam’s Razor

  • Given two models of similar generalization errors, one should prefer the simpler model over the more complex model
  • For complex models, there is a greater chance that it was fitted accidentally by errors in data
  • Therefore, one should include model complexity when evaluating a model

How to Address Overfitting

  • Pre-Pruning (Early Stopping Rule)
  • Stop the algorithm before it becomes a fully-grown tree
  • Typical stopping conditions for a node:
  • Stop if all instances belong to the same class
  • Stop if all the attribute values are the same
  • More restrictive conditions:
  • Stop if number of instances is less than some user-specified threshold
  • Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
  • Stop if expanding the current node does not improve impurity
    measures (e.g., Gini or information gain).

How to Address Overfitting…

  • Post-pruning
  • Grow decision tree to its entirety
  • Trim the nodes of the decision tree in a bottom-up fashion
  • If generalization error improves after trimming, replace sub-tree by a leaf node.
  • Class label of leaf node is determined from majority class of instances in the sub-tree

Example of Post-Pruning

Training Error (Before splitting) = 10/30

Pessimistic error = (10 + 0.5)/30 = 10.5/30

Training Error (After splitting) = 7/30

Pessimistic error (After splitting)

= (7 + 4  0.5)/30 = 9/30

PRUNE OR DO NOT PRUNE

Class = Yes 20
Class = No 10
Error = ?
Class = Yes 8
Class = No 4
Class = Yes 2
Class = No 5
Class = Yes 6
Class = No 1
Class = Yes 4
Class = No 0

Handling Missing Attribute Values

  • Missing values affect decision tree construction in three different ways:
  • Affects how impurity measures are computed
  • Affects how to distribute instance with missing value to child nodes
  • Affects how a test instance with missing value is classified

Model Evaluation

  • Metrics for Performance Evaluation
  • How to evaluate the performance of a model?

  • Methods for Performance Evaluation
  • How to obtain reliable estimates?
  • Methods for Model Comparison
  • How to compare the relative performance among competing models?

Model Evaluation

  • Metrics for Performance Evaluation
  • How to evaluate the performance of a model?

  • Methods for Performance Evaluation
  • How to obtain reliable estimates?
  • Methods for Model Comparison
  • How to compare the relative performance among competing models?

Metrics for Performance Evaluation

  • Focus on the predictive capability of a model
  • Rather than how fast it takes to classify or build models, scalability, etc.
  • Confusion Matrix:

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

PREDICTED CLASS
ACTUAL CLASS Class=Yes Class=No
Class=Yes a b
Class=No c d

Metrics for Performance Evaluation…

  • Most widely-used metric:
PREDICTED CLASS
ACTUAL CLASS Class=Yes Class=No
Class=Yes a (TP) b (FN)
Class=No c (FP) d (TN)

Limitation of Accuracy

  • Consider a 2-class problem
  • Number of Class 0 examples = 9990
  • Number of Class 1 examples = 10
  • If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
  • Accuracy is misleading because model does not detect any class 1 example

Cost Matrix

C(i|j): Cost of misclassifying class j example as class i

PREDICTED CLASS
ACTUAL CLASS C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)

Computing Cost of Classification

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255

Cost Matrix PREDICTED CLASS
ACTUAL CLASS C(i|j) + -
+ -1 100
- 1 0
Model M1 PREDICTED CLASS
ACTUAL CLASS + -
+ 150 40
- 60 250
Model M2 PREDICTED CLASS
ACTUAL CLASS + -
+ 250 45
- 5 200

Cost vs Accuracy

Count PREDICTED CLASS
ACTUAL CLASS Class=Yes Class=No
Class=Yes a b
Class=No c d
Cost PREDICTED CLASS
ACTUAL CLASS Class=Yes Class=No
Class=Yes p q
Class=No q p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p)  Accuracy]

Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p

Model Evaluation

  • Metrics for Performance Evaluation
  • How to evaluate the performance of a model?

  • Methods for Performance Evaluation
  • How to obtain reliable estimates?
  • Methods for Model Comparison
  • How to compare the relative performance among competing models?

Methods for Performance Evaluation

  • How to obtain a reliable estimate of performance?
  • Performance of a model may depend on other factors besides the learning algorithm:
  • Class distribution
  • Cost of misclassification
  • Size of training and test sets

Methods of Estimation

  • Holdout
  • Reserve 2/3 for training and 1/3 for testing
  • Random subsampling
  • Repeated holdout
  • Cross validation
  • Partition data into k disjoint subsets
  • k-fold: train on k-1 partitions, test on the remaining one
  • Leave-one-out: k=n
  • Stratified sampling
  • oversampling vs undersampling
  • Bootstrap
  • Sampling with replacement

A?

A1

A2

A3

A4

FN

FP

TN

TP

TN

TP

d

c

b

a

d

a

+

+

+

+

=

+

+

+

+

=

Accuracy