INDIVIDUAL ASSIGNMENT

profileerica.m
Lecture06.pdf

Lecture 6 Bagging and Boosting

1 / 45

Recap

In the last lecture, we discussed decision trees:

binary trees

how to perform prediction

how to grow a tree, with different criteria for regression and classification

Example: List of AI misuses https://github.com/daviddao/awful-ai

2 / 45

Outline

1 Bagging

2 Random Forest

3 Boosting

3 / 45

Bagging Key ideas

Previous lectures: we want to build a flexible model but this can lead to overfitting

the bias-variance trade-off: simple model — high bias but low variance, flexible — low bias but high variance

Bagging: reduce variance without any notable increase in bias

build an ensemble of models (called base models)

predictions = average of predictions from the base models

analogy: wisdom of the crowd

we want the models to use the whole dataset, but predictions should be different.

we train the base models using slightly different versions of the dataset

the different versions are generated using bootstrapping

bagging is short for bootstrap aggregating 4 / 45

Bagging Why does it work?

Suppose there are three base classifiers. Each classifier has misclassification rate, say � = 25%.

Assume the predictions made by different classifiers are independent.

The probability that the ensemble classifier makes a wrong prediction in the case two or more than 2 classifiers make a wrong prediction is

P(X > 2) = 3∑

i=2

( 3

i

) �i (1 − �)3−i = 0.156

which is smaller than individual classifier misclassification rate 0.25.

Usually, the more classifiers the better.

Note: in practice, the predictions are correlated (one model or similar models trained on pretty much the same dataset).

5 / 45

Bagging Bootstrap

Bootstrapping generates versions of a dataset by sampling with replacement. A data point can appear more than once in the bootstrapped dataset.

For a dataset of size n, for each i = 1 : n,

Sample li uniformly on the set of integers {1, ...,n} Pick a point {xli ,yli} and add this to the new dataset

6 / 45

Bagging Bootstrap example: data

7 / 45

Bagging Bootstrap example: one bootstrap

8 / 45

Bagging Example: data

9 / 45

Bagging Example: ensemble members

10 / 45

Bagging Example: bagging vs training once on all data

11 / 45

Bagging Example: bagging performance

12 / 45

Bagging Wrapping notes and Q&A

Advantages:

almost always improve the performance

easy to implement

compatible with any base models

work well in practice — many Kaggle competitions were won by ensemble methods)

Disadvantages:

need to train multiple models (could be expensive)

less interpretable compared to a single model

Questions?

13 / 45

Outline

1 Bagging

2 Random Forest

3 Boosting

14 / 45

Random forest Key ideas

We discussed bagging to construct models on different variants of the same dataset.

Random forest is bagging with decision trees as base models, and random selection of a subset of features/inputs for splitting

This might seem unintuitive as each base model will be slightly poorer. However, predictions from different models will be very different so the variance of the final ensemble will be small.

15 / 45

Random forest Key steps

For tree b = 1 to B:

1 Choose a bootstrap sample of size N from the training set 2 Grow a tree Tb to the bootstrapped data, by recursively

repeat the following steps for each leaf node of the tree, until a convergence criterion is achieved:

1 Select p variables at random from the d variables (p ≤ d) 2 Pick the best variable/split-point among the p variables 3 Split the node into two

Output the ensemble of trees {Tb}Bb=1. Prediction: For regression:

F (x,β) = 1

B

B∑ b=1

Tb(x).

For classification:

F (x,β) = Majority Vote{Tb(x)} B b=1

16 / 45

Random forest Example: data

17 / 45

Random forest Example: bagging ensemble

18 / 45

Random forest Example: random forest ensemble

19 / 45

Random forest Example: random forest vs bagging

20 / 45

Random forest Wrapping notes and Q&A

Advantages:

similar to bagging, but better

Disadvantages:

similar to bagging

work less well on smaller training sets

Questions?

21 / 45

Outline

1 Bagging

2 Random Forest

3 Boosting

22 / 45

Boosting Key ideas

Bagging:

train multiple base models independently / in parallel

each model uses bootstrapped data

reduce variance but not increase bias by much

Boosting:

train multiple base models sequentially, one after another

each model is trained on reweighted data

data points are reweighted to upweight “misclassfied” points and downweight “correctly classified” points

reduce bias

23 / 45

Concept of Boosting

Bagging is a way by using “average” or “vote” to aggregate individual weaker model and gives equal importance to each model.

Similar to bagging, boosting is a general approach that can be applied to may statistical learning methods for regression or classification.

It works in a similar way as bagging, except that the models are gradually grown by using information from the previously grown models.

Boosting does not involve bootstrap datasets, and each model is fit on a modified version (weighted) of the original dataset.

24 / 45

Boosting Algorithm

In the context of classification, boosting algorithm can be described as follows

1 Train a number of weaker classifiers. Each classifier could be very “weak”, e.g. a decision stump (the trees of depth 1)

2 A new classifier should focus more (higher weights) on those data points which were incorrectly classified in the last round. Data points which are wrongly classified get high weight (the algorithm will focus on them)

3 Combine the classifiers by letting them vote on the final prediction

1 These classifiers are weighed to combine them into a single powerful classifier.

2 Classifiers that have low training misclassification/error rates have high weight

The final classifier is a weighted combination of individual weaker classifiers. Boosting is not limited to decision tress and can be used for many classifiers

25 / 45

Boosting Adaboost — training

26 / 45

Example

Consider a training data D = {A(x11,y1),B(x21,y2),C (x31,y3),D(x41,y4),D(x51,y5)} where the positive cases are A, B, D and E , while the negative case is C .

There are in total N = 5 cases and 1 feature (d = 1). See Figure 1.

Figure 1: Demo Dataset

27 / 45

Example - Algorithm

Suppose we have already produced 6 weak classifiers, all are a decision stump with one decision question.

Each weak classifier classifies a case by the rule that “if condition of decision stump is satisfied, then predict 1 (positive), otherwise predict -1 (negative).

For example, consider the first classifier x1 < 1 and the case A. The input x11 of case A does not satisfy the condition x1 < 1, hence the classifier will classify it as negative, which is wrong. See Figure on previous slide

28 / 45

Algorithm Loop 1

1.1 Initialise weights of training examples. Equal weights as no prior information. In this example

wi = 1

N =

1

5 ; i = 1, 2, 3, 4, 5

1.2 Calculate misclassification rate �k for each classifiers F̂k (x,β

(k)). There are 6 classifiers in this example, so k = 1, 2, 3, 4, 5, 6.

1.3 Pick the F̂k (x,β (k)) with the lowest misclassification rate, i.e.,

F̂4(x,β (4)), see Figure 2.

Figure 2

29 / 45

Algorithm Loop 1

1.4 Calculate voting power for the best classifier F̂4(x,β (4)). The

lower misclassification rate, the higher voting power. The algorithm uses natural log as the voting power

αk = 1

2 log

( 1 − �k �k

) =

1

2 log(4)

and the current best classifier is

1

2 log(4)F̂4(x,β

(4)).

1.5 Check whether the stopping criteria are met Stop if one of the following conditions is met

Combined classifier F (x,β) is good enough (at this loop, this

is F̂4(x,β (4)))

Enough number of iterations (we dont wish to loop too long) No good classifier left, e.g., the best left classifier has misclassification rate 0.5

Continue to the next step if no condition is met 30 / 45

Algorithm Loop 1

1.6 Update the weights to examples that are misclassified by the best classifier so far, by using � the misclassification rate of the best classifier, as

wnewi =

{ 1

2(1−�) w old i if the case is correct

1 2� woldi if the case is incorrect

(1)

For example, as case C is incorrectly classified by the current best classifier F̂4(x,β

(4)), so the new weight for case C is

wnew3 = 1

2� wold3 =

1

2 × (1/5) 1

5 =

1

2 .

We have calculated all the new weights for the five cases, shown in column 3 of

Figure 3.

Figure 3

Note: the new weights satisfy the following condition ∑

correct wi = ∑

incorrect wi = 1 2 .

31 / 45

Algorithm Loop 2

The classifier trained in loop 1 is not good enough. Then we go to Loop 2.

2.1 Use the new weights from Loop 1.

2.2 Again, calculate misclassification rate �k for each classifier F̂k (x,β

(k)). Classifiers used in last loop will have �k = 1/2 (weighted)

2.3 Pick F̂k (x,β (k)) with the lowest misclassification rate. If there

are a draw, pick up the first one. In this case, it is F̂2(x,β (2))

Figure 4

32 / 45

Algorithm Loop 2

2.4 Calculate voting power for the best classifier F̂2(x,β (2)).

αk = 1

2 log

( 1 − �k �k

) =

1

2 log(3)

and construct the current best classifier

1

2 log(4)F̂4(x,β

(4)) + 1

2 log(3)F̂2(x,β

(2))

The result is seen Figure 5.

Figure 5

33 / 45

Algorithm Loop 2

2.5 Check whether the stopping criteria are met. If yes, stop; otherwise go to step 2.6

2.6 Update the weights to examples by using (1) again. For example, this time Case C has been correctly classified (i.e. by classifier 2), so the new weight is

wnew3 = 1

2(1 − �) wold3 =

1

2(1 − 2/8) 4

8 =

4

12

Updated weights are shown in Figure 6.

Figure 6

34 / 45

Algorithm Loop 3

The classifier trained in loop 2 is not good enough. Then we go to Loop 3.

3.1 Use the new weights from Loop 2.

3.2 Again, calculate misclassification rate �k for each classifier F̂k (x,β

(k)).

3.3 Pick F̂k (x,β (k)) with the lowest misclassification rate. In this

case, it is F̂6(x,β (6)). See Figure 7

Figure 7

35 / 45

Algorithm Loop 3

3.4 Calculate voting power for the best classifier F̂6(x,β (6)).

αk = 1

2 log

( 1 − �k �k

) =

1

2 log(5)

and construct the current best classifier

1

2 log(4)F̂4(x,β

(4)) + 1

2 log(3)F̂2(x,β

(2)) + + 1

2 log(5)F̂6(x,β

(6))

The result is seen Figure 8.

Figure 8

36 / 45

Algorithm Loop 4

We can continue Loop 4, but the current best classifier

F (x,β) = sgn( 1

2 log(4)F̂4(x,β

(4)) + 1

2 log(3)F̂2(x,β

(2)) + 1

2 log(5)F̂6(x,β

(6)))

can classify all training examples correctly. So we can stop training.

Finally as an exercise, use the above classier to classify the new case F shown in Figure 9.

Figure 9

37 / 45

Boosting Adaboost — prediction

38 / 45

Boosting Example: boosting

39 / 45

Boosting Example: data

40 / 45

Boosting Example: boosting

41 / 45

Boosting Example: bagging

42 / 45

Boosting Example: boosting vs bagging

43 / 45

Boosting Wrapping notes and Q&A

Advantages:

efficiently reduce bias, can use weak base models

work very well in practice

Disadvantages:

need to train base models sequentially

Questions?

44 / 45

Recap

We discussed:

bagging: ensembles of base models on bootstrapped versions of the data

random forest: bagging for trees with randomisation for selecting features for splitting

boosting: ensembles of models trained sequentially, later ones trying to correct/improve over previous ones.

Next week: we will discuss Gradient Boosting.

Thank you!

45 / 45

  • Bagging
  • Random Forest
  • Boosting