ITS 836 DATA SCIENCE AND BIG DATA ANALYSIS

Volk12
ITS836Week11ClassifierPerformanceComparison11paste.pptx

School of Computer & Information Sciences

ITS 836 Data Science and Big Data Analytics

ITS 836

1

ML Algorithms - Applications

ITS 836

2

Overview

Bias and Variance Trade-off with Model Complexity

Supervised Learning - Review Classifiers

Naïve Bayes

Logistic Regression

Decision Trees

Advanced Classifiers

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

ITS 836

3

Bias - Variance

Tradeoff

Measuring Quality of Fit

Suppose we have a regression problem.

One common measure of accuracy is the mean squared error (MSE) i.e.

Where is the prediction our method gives for the observation in our training data.

ITS 836

5

Bias/ Variance Tradeoff Two competing forces govern the choice of learning method

Bias error introduced by modeling:

a real life problem as much simpler problem

usually extremely complicated in real life

E.g. linear regression assumes a linear relationship between Y and X.

In real life, the relationship is not linear so some bias will be present.

more flexible/complex a method is the less bias it will generally have.

Variance how much your estimate for f

Would change by if you had a different training data set

The more flexible a method is the more variance it has

So with Model Complexity:

High -> less bias, high variance

Low -> high bias, low variance

ITS 836

6

The Trade-off with MSE – Mean Squared Error

Given, X=x0, the expected test MSE for a new Y at x0 is:

As a method gets more complex:

the bias will decrease and the variance will increase

but expected test MSE may go up or down!

ITS 836

7

The Classification Setting

For a regression problem, use the MSE to assess the accuracy of the statistical learning method

For a classification problem use the error rate i.e.

The indicator function:

Gives 1 if the condition is correct, otherwise it gives a 0.

the error rate represents the fraction of incorrect classifications, or misclassifications

ITS 836

8

Training vs. Test Error Rates on Simulated Data

Training error rates goes down

as k decreases or equivalently

as the flexibility increases.

Test error rate at first decreases

but then starts to increase again.

ITS 836

9

A Fundamental Picture

Training data errors:

always decline with complexity

Test data errors:

decline at first (as reductions in bias dominate)

then start to increase again (as increases in variance dominate).

More flexible/complicated is not always better!

ITS 836

10

Supervised Learning

Review

Classification

Supervised Learning

ITS 836

12

Build function that generalizes over data it has never seen

Build model with training data

Use test data, analyze performance

Regression vs Classification

Discrete data – “Classification”

Example: Disease vs Healthy

Continuous data – “Regression”

Example:

ITS 836

13

Classification

Naïve Bayes

Logistic Regression

Decision Trees

KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

Supervised Learning Algorithms Classification

Supervised Learning Algorithms Classification

ITS 836

14

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

P(Y=f(x))

Supervised Learning Algorithms Classification

ITS 836

15

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

Supervised Learning Algorithms Classification

ITS 836

16

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

Advanced

Classifiers

SVM

Random Forest

XG-Boost

ANN

Supervised Learning Algorithms Classification

ITS 836

18

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

SVM outputs an optimal hyperplane between points

In two dimentional space:

this hyperplane is a line dividing a plane in two parts

where in each class lay in either side.

19

Support Vector Machine (SVM)

Support vectors

Maximizes

margin

SVMs maximize the margin around the separating hyperplane.

a.k.a. large margin classifiers

Decision function: fully specified by a:

subset of training samples, “support vectors”

SVMs is a quadratic programming problem

Most successful current text classification method*

*but other discriminative methods often perform very similarly

Sec. 15.1

Narrower

margin

ITS 836

Linear Support Vector Machines

x1

x2

=+1

=-1

Data: <xi,yi>, i=1,..,l

xi  Rd

yi  {-1,+1}

ITS 836

20

20

=-1

=+1

Data: <xi,yi>, i=1,..,l

xi  Rd

yi  {-1,+1}

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.

Expressed as w•x+b=0

Equation for a hyperplane from algebra

Find such a hyperplane f(x)=sign(w•x+b), that

correctly classify the data.

f(x)

Linear SVM 2

ITS 836

21

21

d+

d-

Definitions for SVM

Define the hyperplane H such that:

xi•w+b  +1 when yi =+1

xi•w+b  -1 when yi =-1

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point

The margin of a separating hyperplane is d+ + d-.

H

H1 and H2 are the planes:

H1: xi•w+b = +1

H2: xi•w+b = -1

The points on the planes H1 and H2 are the Support Vectors

H1

H2

ITS 836

22

22

No Linear Separators - Kernels

mapped to higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

ITS 836

23

Example – Kernel Trick

ITS 836

24

z = x² + y².

Supervised Learning Algorithms Classification

ITS 836

25

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction

Background: Decision Trees can Overfit

Perfectly fit to any training data

Zero bias, high variance

Two approaches to reduce variance:

Stop growing the tree when further splitting the data does not yield an improvement

Grow a full tree, then prune the tree, by eliminating nodes.

ITS 836

26

Bootstrap Aggregation = Bagging

Bootstrap Aggregation = bagging:

a powerful technique that reduces model variances (overfitting)

Take the original dataset and create M subsets each

with m samples per subset.

The m individual samples are uniformly sampled with replacement from the original dataset. 

ITS 836

27

Random Forest Classifier

N examples

Training Data

M features

ITS 836

28

As the classifier of choice we adopt Random Forest Classifier, due to its robustness to heterogenous and noisy feature.

Random Forest is an ensembe classifier. Briefly, given N data and M features, bootsrap samples are created from the traning data

Step 1. Create Bootstrap Samples Random Forest Classifier

N examples

Create bootstrap samples

from the training data

....…

M features

ITS 836

29

Step 2. Construct Decision Trees Random Forest Classifier

N examples

Construct a decision tree

....…

M features

ITS 836

30

From eachd descision tree .. In spliting the nodes, the Gini Gain is employed

Step 3. Choose m<M features Random Forest Classifier

N examples

....…

M features

At each node in choosing the split feature

choose only among m<M features

ITS 836

31

Different from the regular decision trees in random forest, when the splitting feature is chosen from only a subset of allf eatures.

The robostness of the classifier arises from

bootsraping of the training data and the random selection of features, the the random choose of features.

Step 4. Create Decision for each Bootstrap Sample Random Forest Classifier

Create decision tree

from each bootstrap sample

N examples

....…

....…

M features

ITS 836

32

Step 5. Majority Vote Random Forest Classifier

N examples

....…

....…

Take he majority vote

M features

ITS 836

33

Finally the given an input data the decision is made as follows given an input data,

Supervised Learning Algorithms Classification

ITS 836

34

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting”

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

Bagging vs Boosting

N new training data sets are produced by random sampling with replacement from the original set

Bagging, each element has same probability to appear in a new data set.

Boosting: each element weighted - some of them will take part in the new sets more often

ITS 836

35

Bagging vs Boosting 2

Bagging: train in parallel

result is obtained by averaging the responses of the N (Majority vote)

Boosting: train sequentially taking into account the previous classifiers’ success –

the algorithm allocates weights to each resulting model.

a good classifier on the training data assigned a higher weight than a poor one

ITS 836

36

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

Random Forest Revisited

ITS 836

37

XGBoost

XGBoost:“Extreme Gradient Boosting”

ITS 836

38

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

Supervised Learning Algorithms Classification

ITS 836

39

Classification

Naïve Bayes

Logistic Regression

Decision Trees

K-Nearest Neighbor KNN

Support Vector Machine

Random Forest

XG Boost

Artificial Neural Networks

Questions?

ITS 836

40

å

=

-

=

n

i

i

i

y

y

n

MSE

1

2

)

ˆ

(

1

i

y

ˆ

ExpectedTestMSE= E Y − f (x0 )( ) 2 = Bias2 +Var+ σ 2

Irreducible Error 

ExpectedTestMSE=EY-f(x

0

)

()

2

=Bias

2

+Var+s

2

Irreducible Error

n

y

y

I

Rate

Error

n

i

i

i

/

)

ˆ

(

1

å

=

¹

=

)

ˆ

(

i

i

y

y

I

¹

)

ˆ

(

i

i

y

y

¹

X

X

e

e

Y

P

p

1

0

1

0

1

)

1

(

b

b

b

b

+

+

+

=

=

=