INDIVIDUAL ASSIGNMENT
Lecture 6 Bagging and Boosting
1 / 45
Recap
In the last lecture, we discussed decision trees:
binary trees
how to perform prediction
how to grow a tree, with different criteria for regression and classification
Example: List of AI misuses https://github.com/daviddao/awful-ai
2 / 45
Outline
1 Bagging
2 Random Forest
3 Boosting
3 / 45
Bagging Key ideas
Previous lectures: we want to build a flexible model but this can lead to overfitting
the bias-variance trade-off: simple model — high bias but low variance, flexible — low bias but high variance
Bagging: reduce variance without any notable increase in bias
build an ensemble of models (called base models)
predictions = average of predictions from the base models
analogy: wisdom of the crowd
we want the models to use the whole dataset, but predictions should be different.
we train the base models using slightly different versions of the dataset
the different versions are generated using bootstrapping
bagging is short for bootstrap aggregating 4 / 45
Bagging Why does it work?
Suppose there are three base classifiers. Each classifier has misclassification rate, say � = 25%.
Assume the predictions made by different classifiers are independent.
The probability that the ensemble classifier makes a wrong prediction in the case two or more than 2 classifiers make a wrong prediction is
P(X > 2) = 3∑
i=2
( 3
i
) �i (1 − �)3−i = 0.156
which is smaller than individual classifier misclassification rate 0.25.
Usually, the more classifiers the better.
Note: in practice, the predictions are correlated (one model or similar models trained on pretty much the same dataset).
5 / 45
Bagging Bootstrap
Bootstrapping generates versions of a dataset by sampling with replacement. A data point can appear more than once in the bootstrapped dataset.
For a dataset of size n, for each i = 1 : n,
Sample li uniformly on the set of integers {1, ...,n} Pick a point {xli ,yli} and add this to the new dataset
6 / 45
Bagging Bootstrap example: data
7 / 45
Bagging Bootstrap example: one bootstrap
8 / 45
Bagging Example: data
9 / 45
Bagging Example: ensemble members
10 / 45
Bagging Example: bagging vs training once on all data
11 / 45
Bagging Example: bagging performance
12 / 45
Bagging Wrapping notes and Q&A
Advantages:
almost always improve the performance
easy to implement
compatible with any base models
work well in practice — many Kaggle competitions were won by ensemble methods)
Disadvantages:
need to train multiple models (could be expensive)
less interpretable compared to a single model
Questions?
13 / 45
Outline
1 Bagging
2 Random Forest
3 Boosting
14 / 45
Random forest Key ideas
We discussed bagging to construct models on different variants of the same dataset.
Random forest is bagging with decision trees as base models, and random selection of a subset of features/inputs for splitting
This might seem unintuitive as each base model will be slightly poorer. However, predictions from different models will be very different so the variance of the final ensemble will be small.
15 / 45
Random forest Key steps
For tree b = 1 to B:
1 Choose a bootstrap sample of size N from the training set 2 Grow a tree Tb to the bootstrapped data, by recursively
repeat the following steps for each leaf node of the tree, until a convergence criterion is achieved:
1 Select p variables at random from the d variables (p ≤ d) 2 Pick the best variable/split-point among the p variables 3 Split the node into two
Output the ensemble of trees {Tb}Bb=1. Prediction: For regression:
F (x,β) = 1
B
B∑ b=1
Tb(x).
For classification:
F (x,β) = Majority Vote{Tb(x)} B b=1
16 / 45
Random forest Example: data
17 / 45
Random forest Example: bagging ensemble
18 / 45
Random forest Example: random forest ensemble
19 / 45
Random forest Example: random forest vs bagging
20 / 45
Random forest Wrapping notes and Q&A
Advantages:
similar to bagging, but better
Disadvantages:
similar to bagging
work less well on smaller training sets
Questions?
21 / 45
Outline
1 Bagging
2 Random Forest
3 Boosting
22 / 45
Boosting Key ideas
Bagging:
train multiple base models independently / in parallel
each model uses bootstrapped data
reduce variance but not increase bias by much
Boosting:
train multiple base models sequentially, one after another
each model is trained on reweighted data
data points are reweighted to upweight “misclassfied” points and downweight “correctly classified” points
reduce bias
23 / 45
Concept of Boosting
Bagging is a way by using “average” or “vote” to aggregate individual weaker model and gives equal importance to each model.
Similar to bagging, boosting is a general approach that can be applied to may statistical learning methods for regression or classification.
It works in a similar way as bagging, except that the models are gradually grown by using information from the previously grown models.
Boosting does not involve bootstrap datasets, and each model is fit on a modified version (weighted) of the original dataset.
24 / 45
Boosting Algorithm
In the context of classification, boosting algorithm can be described as follows
1 Train a number of weaker classifiers. Each classifier could be very “weak”, e.g. a decision stump (the trees of depth 1)
2 A new classifier should focus more (higher weights) on those data points which were incorrectly classified in the last round. Data points which are wrongly classified get high weight (the algorithm will focus on them)
3 Combine the classifiers by letting them vote on the final prediction
1 These classifiers are weighed to combine them into a single powerful classifier.
2 Classifiers that have low training misclassification/error rates have high weight
The final classifier is a weighted combination of individual weaker classifiers. Boosting is not limited to decision tress and can be used for many classifiers
25 / 45
Boosting Adaboost — training
26 / 45
Example
Consider a training data D = {A(x11,y1),B(x21,y2),C (x31,y3),D(x41,y4),D(x51,y5)} where the positive cases are A, B, D and E , while the negative case is C .
There are in total N = 5 cases and 1 feature (d = 1). See Figure 1.
Figure 1: Demo Dataset
27 / 45
Example - Algorithm
Suppose we have already produced 6 weak classifiers, all are a decision stump with one decision question.
Each weak classifier classifies a case by the rule that “if condition of decision stump is satisfied, then predict 1 (positive), otherwise predict -1 (negative).
For example, consider the first classifier x1 < 1 and the case A. The input x11 of case A does not satisfy the condition x1 < 1, hence the classifier will classify it as negative, which is wrong. See Figure on previous slide
28 / 45
Algorithm Loop 1
1.1 Initialise weights of training examples. Equal weights as no prior information. In this example
wi = 1
N =
1
5 ; i = 1, 2, 3, 4, 5
1.2 Calculate misclassification rate �k for each classifiers F̂k (x,β
(k)). There are 6 classifiers in this example, so k = 1, 2, 3, 4, 5, 6.
1.3 Pick the F̂k (x,β (k)) with the lowest misclassification rate, i.e.,
F̂4(x,β (4)), see Figure 2.
Figure 2
29 / 45
Algorithm Loop 1
1.4 Calculate voting power for the best classifier F̂4(x,β (4)). The
lower misclassification rate, the higher voting power. The algorithm uses natural log as the voting power
αk = 1
2 log
( 1 − �k �k
) =
1
2 log(4)
and the current best classifier is
1
2 log(4)F̂4(x,β
(4)).
1.5 Check whether the stopping criteria are met Stop if one of the following conditions is met
Combined classifier F (x,β) is good enough (at this loop, this
is F̂4(x,β (4)))
Enough number of iterations (we dont wish to loop too long) No good classifier left, e.g., the best left classifier has misclassification rate 0.5
Continue to the next step if no condition is met 30 / 45
Algorithm Loop 1
1.6 Update the weights to examples that are misclassified by the best classifier so far, by using � the misclassification rate of the best classifier, as
wnewi =
{ 1
2(1−�) w old i if the case is correct
1 2� woldi if the case is incorrect
(1)
For example, as case C is incorrectly classified by the current best classifier F̂4(x,β
(4)), so the new weight for case C is
wnew3 = 1
2� wold3 =
1
2 × (1/5) 1
5 =
1
2 .
We have calculated all the new weights for the five cases, shown in column 3 of
Figure 3.
Figure 3
Note: the new weights satisfy the following condition ∑
correct wi = ∑
incorrect wi = 1 2 .
31 / 45
Algorithm Loop 2
The classifier trained in loop 1 is not good enough. Then we go to Loop 2.
2.1 Use the new weights from Loop 1.
2.2 Again, calculate misclassification rate �k for each classifier F̂k (x,β
(k)). Classifiers used in last loop will have �k = 1/2 (weighted)
2.3 Pick F̂k (x,β (k)) with the lowest misclassification rate. If there
are a draw, pick up the first one. In this case, it is F̂2(x,β (2))
Figure 4
32 / 45
Algorithm Loop 2
2.4 Calculate voting power for the best classifier F̂2(x,β (2)).
αk = 1
2 log
( 1 − �k �k
) =
1
2 log(3)
and construct the current best classifier
1
2 log(4)F̂4(x,β
(4)) + 1
2 log(3)F̂2(x,β
(2))
The result is seen Figure 5.
Figure 5
33 / 45
Algorithm Loop 2
2.5 Check whether the stopping criteria are met. If yes, stop; otherwise go to step 2.6
2.6 Update the weights to examples by using (1) again. For example, this time Case C has been correctly classified (i.e. by classifier 2), so the new weight is
wnew3 = 1
2(1 − �) wold3 =
1
2(1 − 2/8) 4
8 =
4
12
Updated weights are shown in Figure 6.
Figure 6
34 / 45
Algorithm Loop 3
The classifier trained in loop 2 is not good enough. Then we go to Loop 3.
3.1 Use the new weights from Loop 2.
3.2 Again, calculate misclassification rate �k for each classifier F̂k (x,β
(k)).
3.3 Pick F̂k (x,β (k)) with the lowest misclassification rate. In this
case, it is F̂6(x,β (6)). See Figure 7
Figure 7
35 / 45
Algorithm Loop 3
3.4 Calculate voting power for the best classifier F̂6(x,β (6)).
αk = 1
2 log
( 1 − �k �k
) =
1
2 log(5)
and construct the current best classifier
1
2 log(4)F̂4(x,β
(4)) + 1
2 log(3)F̂2(x,β
(2)) + + 1
2 log(5)F̂6(x,β
(6))
The result is seen Figure 8.
Figure 8
36 / 45
Algorithm Loop 4
We can continue Loop 4, but the current best classifier
F (x,β) = sgn( 1
2 log(4)F̂4(x,β
(4)) + 1
2 log(3)F̂2(x,β
(2)) + 1
2 log(5)F̂6(x,β
(6)))
can classify all training examples correctly. So we can stop training.
Finally as an exercise, use the above classier to classify the new case F shown in Figure 9.
Figure 9
37 / 45
Boosting Adaboost — prediction
38 / 45
Boosting Example: boosting
39 / 45
Boosting Example: data
40 / 45
Boosting Example: boosting
41 / 45
Boosting Example: bagging
42 / 45
Boosting Example: boosting vs bagging
43 / 45
Boosting Wrapping notes and Q&A
Advantages:
efficiently reduce bias, can use weak base models
work very well in practice
Disadvantages:
need to train base models sequentially
Questions?
44 / 45
Recap
We discussed:
bagging: ensembles of base models on bootstrapped versions of the data
random forest: bagging for trees with randomisation for selecting features for splitting
boosting: ensembles of models trained sequentially, later ones trying to correct/improve over previous ones.
Next week: we will discuss Gradient Boosting.
Thank you!
45 / 45
- Bagging
- Random Forest
- Boosting