INDIVIDUAL ASSIGNMENT

profileerica.m
Lecture07.pdf

Lecture 7 Gradient Boosting

1 / 46

Recap

In the last lecture, we discussed:

bagging: ensembles of base models on bootstrapped versions of the data

random forest: bagging for trees with randomisation for selecting features for splitting

boosting: ensembles of models trained sequentially, later ones trying to correct/improve over previous ones. Adaboost reweights data points so that misclassified points get higher weights.

Example: Combining decision trees and logistic regression – Practical Lessons from Predicting Clicks on Ads at Facebook (He et al), https://research.fb.com/publications/ practical-lessons-from-predicting-clicks-on-ads-at-facebook/.

2 / 46

Outline

1 Review of Bagging and Boosting

2 Gradient boosting for regression

3 Gradient boosting for multiclass classification

4 A bit of history and a comparison

5 Gradient tree boosting and XGBoost

3 / 46

A quick review of Bagging and Boosting

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

4 / 46

A quick review of Bagging and Boosting

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

5 / 46

A quick review of Bagging and Boosting

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

6 / 46

A quick review of Bagging and Boosting

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

7 / 46

Outline

1 Review of Bagging and Boosting

2 Gradient boosting for regression

3 Gradient boosting for multiclass classification

4 A bit of history and a comparison

5 Gradient tree boosting and XGBoost

8 / 46

Set up and idea

Consider a regression task given the training set {(xn,yn)}Nn=1.

A natural objective function for this task is the squared loss.

Suppose we already have a model f (x) but we are not happy with it.

Question: How can we improve the current model f (x)?

Key idea: Look at the differences (aka the residuals): y − f (x) and try to find a function h(x) to fit these.

That is, we want to keep f (x) fixed, and find h(x) such that f (x) + h(x) approximately match y.

9 / 46

Practical implementation

Given the data {(xn,yn)}Nn=1 and the previous fit f (x)

We first find the residuals {(xn,yn −Fn)}Nn=1, Fn = f (xn).

We then fit another regression model (linear, decision tree ...) h(x) to the residuals

If f (x) + h(x) is still not good enough, consider it as the new f (x) and repeat...

10 / 46

Relationship to gradient descent

Consider the loss function L(y, f (x)) = 1 2

∑N n=1(yn −Fn)

2

If we treat the actual function values {Fn}Nn=1 as parameters, ∂L

∂Fn = Fn −yn

That is, the residuals are exactly the negative gradients!

Thus the new function value Fn after fitting h(x):

Fn = f (xn) ← Fn + h(xn) new func = prev func + new fit ≈ Fn + (yn −Fn) new fit approximates previous residuals

= Fn − ∂L

∂Fn residuals = negative gradients

i.e. we have done approx. gradient descent with step size = 1! 11 / 46

Gradient boosting for regression with the squared loss

Start with an initial model, for example, a constant model f (x) ≡ 1

N

∑N n=1 yn

Do the following steps until some criteria are satisfied:

Calculate negative gradients for n = 1, 2, ...,N,

−g(xn) := − ∂L

∂Fn = yn −Fn

Fit a model h(x) to negative gradients {−g(xn)}Nn=1 s.t.

h(xn) ≈−g(xn)

Note this is a regression problem.

Construct the new model as

f new(x) := f (x) + ρh(x)

with a learning rate ρ = 1. This rate can be optimised.

12 / 46

Generalisation to arbitrary differentiable loss function

Note that the only item specific to regression is the loss and the gradient computation (in red in the last slide). We can change this line to the gradient of any differentiable loss function.

Start with an initial model, f (x) Do the following steps until some criteria are satisfied:

Calculate negative gradients for n = 1, 2, ...,N,

−g(xn) := − ∂L

∂Fn

Fit a model h(x) to negative gradients {−g(xn)}Nn=1 s.t.

h(xn) ≈−g(xn)

Note this is a regression problem. Construct the new model as

f new(x) := f (x) + ρh(x)

with a learning rate ρ = 1.

13 / 46

Outline

1 Review of Bagging and Boosting

2 Gradient boosting for regression

3 Gradient boosting for multiclass classification

4 A bit of history and a comparison

5 Gradient tree boosting and XGBoost

14 / 46

Gradient boosting Generalisation to arbitrary differential loss function

Start with an initial model, f (x) Do the following steps until some criteria are satisfied:

Calculate negative gradients for n = 1, 2, ...,N,

−g(xn) := − ∂L

∂Fn

Fit a model h(x) to negative gradients {−g(xn)}Nn=1 s.t. h(xn) ≈−g(xn)

Note this is a regression problem. Construct the new model as

f new(x) := f (x) + ρh(x)

with a learning rate ρ = 1.

We need to define f (x) and g(xn) for multiclass classification 15 / 46

Multiclass classification [Lecture 3] Set up

Multiclass classification: multiple potential outcomes (1/2/3/.../K, or positive/neutral/negative) We construct a classifer that learns the class probabilities:

p(y = k|x) : probability for class m given input x

For multiclass class’n: k = 1 or k = 2 or ... k = K, and by laws of probabilities:

p(y = 1|x) + p(y = 2|x) + · · · + p(y = K|x) = 1

Imagine p(y = 1|x) = P(1)(x),p(y = 2|x) = P(2)(x), . . . ,p(y = K|x) = P(K )(x). We need:

0 ≤ Pk (x) ≤ 1,∀x,k P1(x) + P2(x) + · · · + PK (x) = 1

16 / 46

Multiclass classification [Lecture 3] Softmax

We define M funcs, {fk (x; β)}Kk=1

Reminder: we want {Pk (x; β)}Kk=1 such that 0 ≤ Pk (x; β) ≤ 1,∀x,m and

∑ k Pk (x; β) = 1.

The idea is to “squash” {fk (x; β)}Kk=1 to the K − 1 simplex:

Pk (x; β) = softmax(fk (x; β)) = exp(fk (x; β))∑K i=1 exp(fk (x; β))

17 / 46

Multiclass Classification [Lecture 3] Objective function

Note we can write the likelihood:

p(y = k|x,β) = Pk (x; β) Similar to the binary classification case, we want to minimise the negative log-likelihood:

L(β) = − N∑

n=1

log(Pyn (xn; β))

which is often referred to the multi-class cross-entropy loss.

If one-hot coding is used, tn = (tn1,tn2, ...,tnK ) where tnk = 1 if yn = k and 0 otherwise, the loss function above can be rewritten as

L(β) = − N∑

n=1

K∑ k=1

tnk log(Pk (xn; β))

18 / 46

Back to gradient boosting Multiclass: functions, loss and gradients

Instead of one function as in regression, we will have K functions, fk (x)

We have the loss function

L = − N∑

n=1

K∑ k=1

tnk log(Pk (xn; β))

So we can derive the negative gradients wrt the function values, {fk (xn)},

− ∂L

∂fk (xn) = tnk −Pk (xn)

19 / 46

Gradient boosting for multiclass classification A summary

Start with K initial functions, {fk (x) ≡ 0}Kk=1 Do the following steps until some criteria are satisfied:

Calculate negative gradients for n = 1, 2, ...,N, k = 1, 2, ...,K ,

−gk (xn) := − ∂L

∂Fnk

Fit K models {hk (x)}Kk=1 to negative gradients {−gk (xn)}nk s.t.

hk (xn) ≈−gk (xn)

Note this is a regression problem.

Construct the new models as

f newk (x) := fk (x) + ρhk (x)

with learning rate ρ

20 / 46

Classification Example (M=3)

21 / 46

Classification Example (M=3)

Step 1: Initial models

22 / 46

Classification Example (M=3)

Step 2: Calculate the negative gradients

23 / 46

Classification Example (M=3)

Step 3a: Modeling from X to −g1 [Model 1]

24 / 46

Classification Example (M=3)

Built a decision tree h1(x) of depth 1 from all X to −g1:

h1(x) =

{ 0.667 if x4 ≤ 0.65 −0.333 if x4 > 0.65

25 / 46

Classification Example (M=3)

Step 3b: Modeling from X to −g2 [Model 2]

26 / 46

Classification Example (M=3)

Built a decision tree h2(x) of depth 1 from all X to −g2:

h2(x) =

{ 0.667 if x2 ≤ 2.95 −0.222 if x2 > 2.95

27 / 46

Classification Example (M=3)

Step 3c: Modeling from X to −g3 [Model 3]

28 / 46

Classification Example (M=3)

Built a decision tree h3(x) of depth 1 from all X to −g3:

h3(x) =

{ −0.333 if x4 ≤ 1.70 0.667 if x4 > 1.70

29 / 46

Classification Example (M=3)

Step 4: Update three models fm(x) := fm(x) + ρhm(x), setting ρ = 1 for simplicity

f1(x) := f1(x) + h1(x) = 0 + h1(x) =

{ 0.667 if x4 ≤ 0.65 −0.333 if x4 > 0.65

f2(x) := f2(x) + h2(x) = 0 + h2(x) =

{ 0.667 if x2 ≤ 2.95 −0.222 if x2 > 2.95

f3(x) := f3(x) + h3(x) = 0 + h3(x) =

{ −0.333 if x4 ≤ 1.70 0.667 if x4 > 1.70

30 / 46

Classification Example (M=3)

Step 5: Update Ps from the new F s.

31 / 46

Classification Example (M=3)

Step 6: Find negative gradients again.

32 / 46

Classification Example (M=3)

Step 7: Modeling negative gradients again, we have the second set of basic models h1(x), h2(x) and h3(x):

h1(x) =

{ 0.437 if x4 ≤ 0.65 −0.225 if x4 > 0.65

h2(x) =

{ 0.278 if x2 ≤ 2.95 −0.076 if x2 > 2.95

h3(x) =

{ −0.155 if x4 ≤ 1.70 0.296 if x4 > 1.70

Note: the splitting points for this set of basic models are the same as the first set of basic models.

33 / 46

Classification Example (M=3)

Step 8: Final Models (suppose we stop here):

f1(x) := f1(x) + h1(x) =

{ 1.104 if x4 ≤ 0.65 −0.558 if x4 > 0.65

f2(x) := f2(x) + h2(x) =

{ 0.945 if x2 ≤ 2.95 −0.298 if x2 > 2.95

f3(x) := f3(x) + h3(x) =

{ −0.488 if x4 ≤ 1.70 0.963 if x4 > 1.70

34 / 46

Classification Example (M=3)

Step 9: Prediction on x∗ = [4.7, 3.2, 1.3, 0.2]T . For this point we have

F1(x ∗) = 1.104; F2(x

∗) = −0.298; F3(x∗) = −0.488

Hence the probabilities are

P1(x ∗) =

ef1(x ∗)

ef1(x ∗) + ef2(x

∗) + ef3(x ∗)

= 0.690

P2(x ∗) =

ef2(x ∗)

ef1(x ∗) + ef2(x

∗) + ef3(x ∗)

= 0.170

P3(x ∗) =

ef3(x ∗)

ef1(x ∗) + ef2(x

∗) + ef3(x ∗)

= 0.140

Now this case can be classified as Class 1. 35 / 46

Outline

1 Review of Bagging and Boosting

2 Gradient boosting for regression

3 Gradient boosting for multiclass classification

4 A bit of history and a comparison

5 Gradient tree boosting and XGBoost

36 / 46

History of boosting and gradient boosting

Adaboost was invented by Freund et al. (1996), Freund and Schapire (1997) Breiman et al. (1998), Breiman (1999) interpreted Adaboost as a gradient descent algorithm under a special loss function Friedman et al. (2000), Friedman (2001) proposed Gradient Boosting for any loss functions.

Adaboost trains models sequentially, reweights data points, and can be interpreted as gradient descent with the exponential loss function.

Gradient boosting trains models sequentially, considers the negative gradients/residuals, and can be used with any differential loss function.

37 / 46

Some maths: Gradient Boosting∗

Let F (m−1)(xn) be the prediction of the n-th case at the (m − 1)-th iteration, and at m-th iteration, we need to add hm to minimise the following objective

L(m) = N∑

n=1

L(tn,F (m−1)(xn) + hm(xn)) + Ω(hm)

Using Taylor approximation, one has

L(m) ≈ N∑

n=1

[ L(tn,F

(m−1)(xn)) + g(xn)hm(xn) + 1

2 e(xn)h

2 m(xn)

] +Ω(hm)

where

g(xn) = ∂L(tn,F

(m−1)(xn))

∂F (m−1)(xn) e(xn) =

∂2L(tn,F (m−1)(xn))

∂F (m−1)(xn)∂F (m−1)(xn)

38 / 46

Some more maths: Gradient Boosting∗

We write the objective as

L(m) = N∑

n=1

1

2 e(xn)

[ hm(xn) −

( − g(xn)

e(xn)

)]2 + Ω(hm) + const.

Minimising L(m) is equivalent to fitting hm to − g(xn) e(xn)

For example, for the regression with the squared error loss function, we have e(xn) = 1. That is why we fit hm to the negative gradient −g(xn). For the cross entropy loss function, we no longer have e(xn) = 1, but we could still try to fit hm to the negative gradient −g(xn)

39 / 46

Outline

1 Review of Bagging and Boosting

2 Gradient boosting for regression

3 Gradient boosting for multiclass classification

4 A bit of history and a comparison

5 Gradient tree boosting and XGBoost

40 / 46

Gradient Tree Boosting

Using a tree model hm in m-th iteration, Gradient Tree Boosting takes two steps:

Step 1: Fitting the tree hm to − g(xn) e(xn)

. Keep the tree structure,

i.e., the partition of the input space as J regions R1, R2, ..., RJ Step 2: The tree can be expressed as

hm(x) = J∑

j=1

βj 1(x,Rj )

where 1(x,Rj ) =

{ 1 if x ∈ Rj 0 otherwise

Optimize, with respect to all βj , the loss

L(m) ≈ N∑

n=1

[ g(xn)hm(xn) +

1

2 e(xn)h

2 m(xn)

] +

1

2 λ

J∑ j=1

β2j + γJ

41 / 46

Gradient Tree Boosting

Consider the model we are working on

hm(x) = J∑

j=1

βj 1(x,Rj )

Can I say, if the case xn falls in the region R1, then

hm(xn) = β1 and h 2 m(xn) = β

2 1

Let us denote

Ij = {n : such as xn ∈ Rj} Then we have

L(m) ≈ J∑

j=1

   ∑

n∈Ij

g(xn)

 βj + 1

2

 ∑

n∈Ij

e(xn) + λ

 β2j

  + γJ

42 / 46

Gradient Tree Boosting

The new L(m) can be optimized with respect to each βj individually

min βj

 ∑

n∈Ij

g(xn)

 βj + 1

2

 ∑

n∈Ij

e(xn) + λ

 β2j

The best solution for βj is given by

β∗j = − ∑

n∈Ij g(xn)∑ n∈Ij e(xn) + λ

43 / 46

Gradient Tree Boosting

The best objective value is

L(m)∗ = − 1

2

J∑ j=1

(∑ n∈Ij g(xn)

)2 ∑

n∈Ij e(xn) + λ + γJ

This value can be used as a scoring to measure the quality of the current tree. We cannot assess all the possible trees. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used. How? Considering a leaf node I (or a region R), split it into two nodes IL and IR , then the score change will be

Lsplit = 1

2

[ (∑ n∈IL g(xn)

)2∑ n∈IL e(xn) + λ

+

(∑ n∈IR g(xn)

)2∑ n∈IR e(xn) + λ

− (∑

n∈I g(xn) )2∑

n∈I e(xn) + λ

] −γ

we shall split the node which produces the largest splitting score. 44 / 46

XGBoost

XGBoost stands for eXtreme Gradient Boosting, an efficient implementation of gradient tree boosting

XGBoost is proposed by Tianqi Chen for large-scale machine learning in 2014 (now has around 8000 citations)

XGBoost is open-sourced and now available in many languages. There are distributed implementations.

XGBoost is widely used in Kaggle competitions for structured or tabular data and used in many companies.

45 / 46

Recap

We discussed:

gradient boosting: how to handle more general loss functions

gradient boosting for regression and classification

xgboost: extreme gradient boosting.

Next week will be the last formal lecture. We will discuss Neural Networks.

Thank you!

46 / 46

  • Review of Bagging and Boosting
  • Gradient boosting for regression
  • Gradient boosting for multiclass classification
  • A bit of history and a comparison
  • Gradient tree boosting and XGBoost