ITS 836 DATA SCIENCE AND BIG DATA ANALYSIS
School of Computer & Information Sciences
ITS 836 Data Science and Big Data Analytics
ITS 836
1
ML Algorithms - Applications
ITS 836
2
Overview
Bias and Variance Trade-off with Model Complexity
Supervised Learning - Review Classifiers
Naïve Bayes
Logistic Regression
Decision Trees
Advanced Classifiers
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
ITS 836
3
Bias - Variance
Tradeoff
Measuring Quality of Fit
Suppose we have a regression problem.
One common measure of accuracy is the mean squared error (MSE) i.e.
Where is the prediction our method gives for the observation in our training data.
ITS 836
5
Bias/ Variance Tradeoff Two competing forces govern the choice of learning method
Bias error introduced by modeling:
a real life problem as much simpler problem
usually extremely complicated in real life
E.g. linear regression assumes a linear relationship between Y and X.
In real life, the relationship is not linear so some bias will be present.
more flexible/complex a method is the less bias it will generally have.
Variance how much your estimate for f
Would change by if you had a different training data set
The more flexible a method is the more variance it has
So with Model Complexity:
High -> less bias, high variance
Low -> high bias, low variance
ITS 836
6
The Trade-off with MSE – Mean Squared Error
Given, X=x0, the expected test MSE for a new Y at x0 is:
As a method gets more complex:
the bias will decrease and the variance will increase
but expected test MSE may go up or down!
ITS 836
7
The Classification Setting
For a regression problem, use the MSE to assess the accuracy of the statistical learning method
For a classification problem use the error rate i.e.
The indicator function:
Gives 1 if the condition is correct, otherwise it gives a 0.
the error rate represents the fraction of incorrect classifications, or misclassifications
ITS 836
8
Training vs. Test Error Rates on Simulated Data
Training error rates goes down
as k decreases or equivalently
as the flexibility increases.
Test error rate at first decreases
but then starts to increase again.
ITS 836
9
A Fundamental Picture
Training data errors:
always decline with complexity
Test data errors:
decline at first (as reductions in bias dominate)
then start to increase again (as increases in variance dominate).
More flexible/complicated is not always better!
ITS 836
10
Supervised Learning
Review
Classification
Supervised Learning
ITS 836
12
Build function that generalizes over data it has never seen
Build model with training data
Use test data, analyze performance
Regression vs Classification
Discrete data – “Classification”
Example: Disease vs Healthy
Continuous data – “Regression”
Example:
ITS 836
13
Classification
Naïve Bayes
Logistic Regression
Decision Trees
KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
Supervised Learning Algorithms Classification
Supervised Learning Algorithms Classification
ITS 836
14
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
P(Y=f(x))
Supervised Learning Algorithms Classification
ITS 836
15
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
Supervised Learning Algorithms Classification
ITS 836
16
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
Advanced
Classifiers
SVM
Random Forest
XG-Boost
ANN
Supervised Learning Algorithms Classification
ITS 836
18
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
SVM outputs an optimal hyperplane between points
In two dimentional space:
this hyperplane is a line dividing a plane in two parts
where in each class lay in either side.
19
Support Vector Machine (SVM)
Support vectors
Maximizes
margin
SVMs maximize the margin around the separating hyperplane.
a.k.a. large margin classifiers
Decision function: fully specified by a:
subset of training samples, “support vectors”
SVMs is a quadratic programming problem
Most successful current text classification method*
*but other discriminative methods often perform very similarly
Sec. 15.1
Narrower
margin
ITS 836
Linear Support Vector Machines
x1
x2
=+1
=-1
Data: <xi,yi>, i=1,..,l
xi Rd
yi {-1,+1}
ITS 836
20
20
=-1
=+1
Data: <xi,yi>, i=1,..,l
xi Rd
yi {-1,+1}
All hyperplanes in Rd are parameterize by a vector (w) and a constant b.
Expressed as w•x+b=0
Equation for a hyperplane from algebra
Find such a hyperplane f(x)=sign(w•x+b), that
correctly classify the data.
f(x)
Linear SVM 2
ITS 836
21
21
d+
d-
Definitions for SVM
Define the hyperplane H such that:
xi•w+b +1 when yi =+1
xi•w+b -1 when yi =-1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+ + d-.
H
H1 and H2 are the planes:
H1: xi•w+b = +1
H2: xi•w+b = -1
The points on the planes H1 and H2 are the Support Vectors
H1
H2
ITS 836
22
22
No Linear Separators - Kernels
mapped to higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
ITS 836
23
Example – Kernel Trick
ITS 836
24
z = x² + y².
Supervised Learning Algorithms Classification
ITS 836
25
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction
Background: Decision Trees can Overfit
Perfectly fit to any training data
Zero bias, high variance
Two approaches to reduce variance:
Stop growing the tree when further splitting the data does not yield an improvement
Grow a full tree, then prune the tree, by eliminating nodes.
ITS 836
26
Bootstrap Aggregation = Bagging
Bootstrap Aggregation = bagging:
a powerful technique that reduces model variances (overfitting)
Take the original dataset and create M subsets each
with m samples per subset.
The m individual samples are uniformly sampled with replacement from the original dataset.
ITS 836
27
Random Forest Classifier
N examples
Training Data
M features
ITS 836
28
As the classifier of choice we adopt Random Forest Classifier, due to its robustness to heterogenous and noisy feature.
Random Forest is an ensembe classifier. Briefly, given N data and M features, bootsrap samples are created from the traning data
Step 1. Create Bootstrap Samples Random Forest Classifier
N examples
Create bootstrap samples
from the training data
....…
M features
ITS 836
29
Step 2. Construct Decision Trees Random Forest Classifier
N examples
Construct a decision tree
....…
M features
ITS 836
30
From eachd descision tree .. In spliting the nodes, the Gini Gain is employed
Step 3. Choose m<M features Random Forest Classifier
N examples
....…
M features
At each node in choosing the split feature
choose only among m<M features
ITS 836
31
Different from the regular decision trees in random forest, when the splitting feature is chosen from only a subset of allf eatures.
The robostness of the classifier arises from
bootsraping of the training data and the random selection of features, the the random choose of features.
Step 4. Create Decision for each Bootstrap Sample Random Forest Classifier
Create decision tree
from each bootstrap sample
N examples
....…
....…
M features
ITS 836
32
Step 5. Majority Vote Random Forest Classifier
N examples
....…
....…
Take he majority vote
M features
ITS 836
33
Finally the given an input data the decision is made as follows given an input data,
Supervised Learning Algorithms Classification
ITS 836
34
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting”
https://xgboost.readthedocs.io/en/latest/tutorials/model.html
Bagging vs Boosting
N new training data sets are produced by random sampling with replacement from the original set
Bagging, each element has same probability to appear in a new data set.
Boosting: each element weighted - some of them will take part in the new sets more often
ITS 836
35
Bagging vs Boosting 2
Bagging: train in parallel
result is obtained by averaging the responses of the N (Majority vote)
Boosting: train sequentially taking into account the previous classifiers’ success –
the algorithm allocates weights to each resulting model.
a good classifier on the training data assigned a higher weight than a poor one
ITS 836
36
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
Random Forest Revisited
ITS 836
37
XGBoost
XGBoost:“Extreme Gradient Boosting”
ITS 836
38
https://xgboost.readthedocs.io/en/latest/tutorials/model.html
Supervised Learning Algorithms Classification
ITS 836
39
Classification
Naïve Bayes
Logistic Regression
Decision Trees
K-Nearest Neighbor KNN
Support Vector Machine
Random Forest
XG Boost
Artificial Neural Networks
Questions?
ITS 836
40
å
=
-
=
n
i
i
i
y
y
n
MSE
1
2
)
ˆ
(
1
i
y
ˆ
ExpectedTestMSE= E Y − f (x0 )( ) 2 = Bias2 +Var+ σ 2
Irreducible Error
ExpectedTestMSE=EY-f(x
0
)
()
2
=Bias
2
+Var+s
2
Irreducible Error
n
y
y
I
Rate
Error
n
i
i
i
/
)
ˆ
(
1
å
=
¹
=
)
ˆ
(
i
i
y
y
I
¹
)
ˆ
(
i
i
y
y
¹
X
X
e
e
Y
P
p
1
0
1
0
1
)
1
(
b
b
b
b
+
+
+
=
=
=