Decision Trees

Shrav

PPT.ppt

Home >Computer Science homework help >Decision Trees

Tree Induction

Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.

Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting

Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong to the same class

Stop expanding a node when all the records have similar attribute values

Early termination (to be discussed later)

Practical Issues of Classification

Underfitting and Overfitting

Missing Values

Costs of Classification

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise

Decision boundary is distorted by noise point

Overfitting due to Noise

* Bats and Whales are misclassified; non-mammals instead of mammals.

Overfitting due to Noise

Decision boundary is distorted by noise point

Both humans and dolphins were misclassified as n0n-mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.

Spiny anteaters represent an exceptional case (every warm-blooded with no gives_birth is non-mammal in TR_Set

Decision tree perfectly fits training data (training error=0)

But error rate on test data is 30%.

Overfitting due to Noise

Estimating Generalization Errors

Re-substitution errors: error on training ( e(t) )
Generalization errors: error on testing ( e’(t))

Methods for estimating generalization errors:
Optimistic approach: e’(t) = e(t)
Pessimistic approach:
For each leaf node: e’(t) = (e(t)+0.5)
Total errors: e’(T) = e(T) + N  0.5 (N: number of leaf nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%

Generalization error = (10 + 300.5)/1000 = 2.5%

Reduced error pruning (REP):
uses validation data set to estimate generalization
error

Occam’s Razor

Given two models of similar generalization errors, one should prefer the simpler model over the more complex model

For complex models, there is a greater chance that it was fitted accidentally by errors in data

Therefore, one should include model complexity when evaluating a model

How to Address Overfitting

Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
Stop if all instances belong to the same class
Stop if all the attribute values are the same
More restrictive conditions:
Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

How to Address Overfitting…

Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
If generalization error improves after trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from majority class of instances in the sub-tree

Example of Post-Pruning

Training Error (Before splitting) = 10/30

Pessimistic error = (10 + 0.5)/30 = 10.5/30

Training Error (After splitting) = 7/30

Pessimistic error (After splitting)

= (7 + 4  0.5)/30 = 9/30

PRUNE OR DO NOT PRUNE

Class = Yes	20
Class = No	10
Error = ?

Class = Yes	8
Class = No	4

Class = Yes	2
Class = No	5

Class = Yes	6
Class = No	1

Class = Yes	4
Class = No	0

Handling Missing Attribute Values

Missing values affect decision tree construction in three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing value to child nodes
Affects how a test instance with missing value is classified

Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation
How to obtain reliable estimates?

Methods for Model Comparison
How to compare the relative performance among competing models?

Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation
How to obtain reliable estimates?

Methods for Model Comparison
How to compare the relative performance among competing models?

Metrics for Performance Evaluation

Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models, scalability, etc.
Confusion Matrix:

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

PREDICTED CLASS
ACTUAL CLASS	Class=Yes	Class=No
Class=Yes	a	b
Class=No	c	d

Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS
ACTUAL CLASS	Class=Yes	Class=No
Class=Yes	a (TP)	b (FN)
Class=No	c (FP)	d (TN)

Limitation of Accuracy

Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect any class 1 example

Cost Matrix

C(i|j): Cost of misclassifying class j example as class i

PREDICTED CLASS
ACTUAL CLASS	C(i\|j)	Class=Yes	Class=No
Class=Yes	C(Yes\|Yes)	C(No\|Yes)
Class=No	C(Yes\|No)	C(No\|No)

Computing Cost of Classification

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255

Cost Matrix	PREDICTED CLASS
ACTUAL CLASS	C(i\|j)	+	-
+	-1	100
-	1	0

Model M1	PREDICTED CLASS
ACTUAL CLASS	+	-
+	150	40
-	60	250

Model M2	PREDICTED CLASS
ACTUAL CLASS	+	-
+	250	45
-	5	200

Cost vs Accuracy

Count	PREDICTED CLASS
ACTUAL CLASS	Class=Yes	Class=No
Class=Yes	a	b
Class=No	c	d

Cost	PREDICTED CLASS
ACTUAL CLASS	Class=Yes	Class=No
Class=Yes	p	q
Class=No	q	p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p)  Accuracy]

Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p

Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation
How to obtain reliable estimates?

Methods for Model Comparison
How to compare the relative performance among competing models?

Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other factors besides the learning algorithm:
Class distribution
Cost of misclassification
Size of training and test sets

Methods of Estimation

Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement

Accuracy