DAta mining week 7 assignment

Abhi

ITS_632_Week5_ClassificationII.pdf

Home >Computer Science homework help >DAta mining week 7 assignment

Dr. Oner Celepcikay

ITS 632

Machine Learning

Week 5

Classification Part II

Tree Induction

● Greedy strategy. – Split the records based on an attribute test

that optimizes certain criterion.

● Issues – Determine how to split the records

uHow to specify the attribute test condition? uHow to determine the best split?

– Determine when to stop splitting

Stopping Criteria for Tree Induction

● Stop expanding a node when all the records belong to the same class

● Stop expanding a node when all the records have similar attribute values

● Early termination (to be discussed later)

Practical Issues of Classification

● Underfitting and Overfitting

● Missing Values

● Costs of Classification

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise

Decision boundary is distorted by noise point

Overfitting due to Noise

* Bats and Whales are misclassified; non-mammals instead of mammals.

Overfitting due to Noise

Decision boundary is distorted by noise point

• Both humans and dolphins were misclassified as n0n- mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.

• Spiny anteaters represent an exceptional case (every warm- blooded with no gives_birth is non-mammal in TR_Set

• Decision tree perfectly fits training data (training error=0)

• But error rate on test data is 30%.

Overfitting due to Noise

Estimating Generalization Errors

● Re-substitution errors: error on training (S e(t) ) ● Generalization errors: error on testing (S e�(t)) ● Methods for estimating generalization errors:

– Optimistic approach: e�(t) = e(t) – Pessimistic approach:

u For each leaf node: e�(t) = (e(t)+0.5) u Total errors: e�(T) = e(T) + N ´ 0.5 (N: number of leaf nodes) u For a tree with 30 leaf nodes and 10 errors on training

(out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 30´0.5)/1000 = 2.5%

– Reduced error pruning (REP): u uses validation data set to estimate generalization

error

Occam�s Razor

● Given two models of similar generalization errors, one should prefer the simpler model over the more complex model

● For complex models, there is a greater chance that it was fitted accidentally by errors in data

● Therefore, one should include model complexity when evaluating a model

How to Address Overfitting

● Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node:

u Stop if all instances belong to the same class u Stop if all the attribute values are the same

– More restrictive conditions: u Stop if number of instances is less than some user-specified threshold u Stop if class distribution of instances are independent of the available features (e.g., using c 2 test) u Stop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).

How to Address Overfitting…

● Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a

bottom-up fashion – If generalization error improves after trimming,

replace sub-tree by a leaf node. – Class label of leaf node is determined from

majority class of instances in the sub-tree

Example of Post-Pruning

A2 A3

Class = Yes 20

Class = No 10 Error = ?

Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 7/30

Pessimistic error (After splitting) = (7 + 4 ´ 0.5)/30 = 9/30

PRUNE OR DO NOT PRUNE

Class = Yes 8 Class = No 4

Class = Yes 2 Class = No 5

Class = Yes 6 Class = No 1

Class = Yes 4 Class = No 0

Handling Missing Attribute Values

● Missing values affect decision tree construction in three different ways: – Affects how impurity measures are computed – Affects how to distribute instance with missing

value to child nodes – Affects how a test instance with missing value

is classified

Model Evaluation

● Metrics for Performance Evaluation – How to evaluate the performance of a model?

● Methods for Performance Evaluation – How to obtain reliable estimates?

● Methods for Model Comparison – How to compare the relative performance

among competing models?

Model Evaluation

● Metrics for Performance Evaluation – How to evaluate the performance of a model?

● Methods for Performance Evaluation – How to obtain reliable estimates?

● Methods for Model Comparison – How to compare the relative performance

among competing models?

Metrics for Performance Evaluation

● Focus on the predictive capability of a model – Rather than how fast it takes to classify or

build models, scalability, etc. ● Confusion Matrix:

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Metrics for Performance Evaluation…

● Most widely-used metric:

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No

Class=Yes a (TP)

b (FN)

Class=No c (FP)

d (TN)

FNFPTNTP TNTP

dcba da

+++ +

= +++

+ =Accuracy

Limitation of Accuracy

● Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10

● If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does

not detect any class 1 example

Cost Matrix

PREDICTED CLASS

ACTUAL CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification

Cost Matrix

PREDICTED CLASS

ACTUAL CLASS

C(i|j) + - + -1 100 - 1 0

Model M1 PREDICTED CLASS

ACTUAL CLASS

+ - + 150 40 - 60 250

Model M2 PREDICTED CLASS

ACTUAL CLASS

+ - + 250 45 - 5 200

Accuracy = 80% Cost = 3910

Accuracy = 90% Cost = 4255

Cost vs Accuracy

Count PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

Cost PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No

Class=Yes p q

Class=No q p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p) ´ Accuracy]

Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

Model Evaluation

● Metrics for Performance Evaluation – How to evaluate the performance of a model?

● Methods for Performance Evaluation – How to obtain reliable estimates?

● Methods for Model Comparison – How to compare the relative performance

among competing models?

Methods for Performance Evaluation

● How to obtain a reliable estimate of performance?

● Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets

Methods of Estimation ● Holdout

– Reserve 2/3 for training and 1/3 for testing ● Random subsampling

– Repeated holdout ● Cross validation

– Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n

● Stratified sampling – oversampling vs undersampling

● Bootstrap – Sampling with replacement