INDIVIDUAL ASSIGNMENT
Lecture 7 Gradient Boosting
1 / 46
Recap
In the last lecture, we discussed:
bagging: ensembles of base models on bootstrapped versions of the data
random forest: bagging for trees with randomisation for selecting features for splitting
boosting: ensembles of models trained sequentially, later ones trying to correct/improve over previous ones. Adaboost reweights data points so that misclassified points get higher weights.
Example: Combining decision trees and logistic regression – Practical Lessons from Predicting Clicks on Ads at Facebook (He et al), https://research.fb.com/publications/ practical-lessons-from-predicting-clicks-on-ads-at-facebook/.
2 / 46
Outline
1 Review of Bagging and Boosting
2 Gradient boosting for regression
3 Gradient boosting for multiclass classification
4 A bit of history and a comparison
5 Gradient tree boosting and XGBoost
3 / 46
A quick review of Bagging and Boosting
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
4 / 46
A quick review of Bagging and Boosting
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
5 / 46
A quick review of Bagging and Boosting
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
6 / 46
A quick review of Bagging and Boosting
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
7 / 46
Outline
1 Review of Bagging and Boosting
2 Gradient boosting for regression
3 Gradient boosting for multiclass classification
4 A bit of history and a comparison
5 Gradient tree boosting and XGBoost
8 / 46
Set up and idea
Consider a regression task given the training set {(xn,yn)}Nn=1.
A natural objective function for this task is the squared loss.
Suppose we already have a model f (x) but we are not happy with it.
Question: How can we improve the current model f (x)?
Key idea: Look at the differences (aka the residuals): y − f (x) and try to find a function h(x) to fit these.
That is, we want to keep f (x) fixed, and find h(x) such that f (x) + h(x) approximately match y.
9 / 46
Practical implementation
Given the data {(xn,yn)}Nn=1 and the previous fit f (x)
We first find the residuals {(xn,yn −Fn)}Nn=1, Fn = f (xn).
We then fit another regression model (linear, decision tree ...) h(x) to the residuals
If f (x) + h(x) is still not good enough, consider it as the new f (x) and repeat...
10 / 46
Relationship to gradient descent
Consider the loss function L(y, f (x)) = 1 2
∑N n=1(yn −Fn)
2
If we treat the actual function values {Fn}Nn=1 as parameters, ∂L
∂Fn = Fn −yn
That is, the residuals are exactly the negative gradients!
Thus the new function value Fn after fitting h(x):
Fn = f (xn) ← Fn + h(xn) new func = prev func + new fit ≈ Fn + (yn −Fn) new fit approximates previous residuals
= Fn − ∂L
∂Fn residuals = negative gradients
i.e. we have done approx. gradient descent with step size = 1! 11 / 46
Gradient boosting for regression with the squared loss
Start with an initial model, for example, a constant model f (x) ≡ 1
N
∑N n=1 yn
Do the following steps until some criteria are satisfied:
Calculate negative gradients for n = 1, 2, ...,N,
−g(xn) := − ∂L
∂Fn = yn −Fn
Fit a model h(x) to negative gradients {−g(xn)}Nn=1 s.t.
h(xn) ≈−g(xn)
Note this is a regression problem.
Construct the new model as
f new(x) := f (x) + ρh(x)
with a learning rate ρ = 1. This rate can be optimised.
12 / 46
Generalisation to arbitrary differentiable loss function
Note that the only item specific to regression is the loss and the gradient computation (in red in the last slide). We can change this line to the gradient of any differentiable loss function.
Start with an initial model, f (x) Do the following steps until some criteria are satisfied:
Calculate negative gradients for n = 1, 2, ...,N,
−g(xn) := − ∂L
∂Fn
Fit a model h(x) to negative gradients {−g(xn)}Nn=1 s.t.
h(xn) ≈−g(xn)
Note this is a regression problem. Construct the new model as
f new(x) := f (x) + ρh(x)
with a learning rate ρ = 1.
13 / 46
Outline
1 Review of Bagging and Boosting
2 Gradient boosting for regression
3 Gradient boosting for multiclass classification
4 A bit of history and a comparison
5 Gradient tree boosting and XGBoost
14 / 46
Gradient boosting Generalisation to arbitrary differential loss function
Start with an initial model, f (x) Do the following steps until some criteria are satisfied:
Calculate negative gradients for n = 1, 2, ...,N,
−g(xn) := − ∂L
∂Fn
Fit a model h(x) to negative gradients {−g(xn)}Nn=1 s.t. h(xn) ≈−g(xn)
Note this is a regression problem. Construct the new model as
f new(x) := f (x) + ρh(x)
with a learning rate ρ = 1.
We need to define f (x) and g(xn) for multiclass classification 15 / 46
Multiclass classification [Lecture 3] Set up
Multiclass classification: multiple potential outcomes (1/2/3/.../K, or positive/neutral/negative) We construct a classifer that learns the class probabilities:
p(y = k|x) : probability for class m given input x
For multiclass class’n: k = 1 or k = 2 or ... k = K, and by laws of probabilities:
p(y = 1|x) + p(y = 2|x) + · · · + p(y = K|x) = 1
Imagine p(y = 1|x) = P(1)(x),p(y = 2|x) = P(2)(x), . . . ,p(y = K|x) = P(K )(x). We need:
0 ≤ Pk (x) ≤ 1,∀x,k P1(x) + P2(x) + · · · + PK (x) = 1
16 / 46
Multiclass classification [Lecture 3] Softmax
We define M funcs, {fk (x; β)}Kk=1
Reminder: we want {Pk (x; β)}Kk=1 such that 0 ≤ Pk (x; β) ≤ 1,∀x,m and
∑ k Pk (x; β) = 1.
The idea is to “squash” {fk (x; β)}Kk=1 to the K − 1 simplex:
Pk (x; β) = softmax(fk (x; β)) = exp(fk (x; β))∑K i=1 exp(fk (x; β))
17 / 46
Multiclass Classification [Lecture 3] Objective function
Note we can write the likelihood:
p(y = k|x,β) = Pk (x; β) Similar to the binary classification case, we want to minimise the negative log-likelihood:
L(β) = − N∑
n=1
log(Pyn (xn; β))
which is often referred to the multi-class cross-entropy loss.
If one-hot coding is used, tn = (tn1,tn2, ...,tnK ) where tnk = 1 if yn = k and 0 otherwise, the loss function above can be rewritten as
L(β) = − N∑
n=1
K∑ k=1
tnk log(Pk (xn; β))
18 / 46
Back to gradient boosting Multiclass: functions, loss and gradients
Instead of one function as in regression, we will have K functions, fk (x)
We have the loss function
L = − N∑
n=1
K∑ k=1
tnk log(Pk (xn; β))
So we can derive the negative gradients wrt the function values, {fk (xn)},
− ∂L
∂fk (xn) = tnk −Pk (xn)
19 / 46
Gradient boosting for multiclass classification A summary
Start with K initial functions, {fk (x) ≡ 0}Kk=1 Do the following steps until some criteria are satisfied:
Calculate negative gradients for n = 1, 2, ...,N, k = 1, 2, ...,K ,
−gk (xn) := − ∂L
∂Fnk
Fit K models {hk (x)}Kk=1 to negative gradients {−gk (xn)}nk s.t.
hk (xn) ≈−gk (xn)
Note this is a regression problem.
Construct the new models as
f newk (x) := fk (x) + ρhk (x)
with learning rate ρ
20 / 46
Classification Example (M=3)
21 / 46
Classification Example (M=3)
Step 1: Initial models
22 / 46
Classification Example (M=3)
Step 2: Calculate the negative gradients
23 / 46
Classification Example (M=3)
Step 3a: Modeling from X to −g1 [Model 1]
24 / 46
Classification Example (M=3)
Built a decision tree h1(x) of depth 1 from all X to −g1:
h1(x) =
{ 0.667 if x4 ≤ 0.65 −0.333 if x4 > 0.65
25 / 46
Classification Example (M=3)
Step 3b: Modeling from X to −g2 [Model 2]
26 / 46
Classification Example (M=3)
Built a decision tree h2(x) of depth 1 from all X to −g2:
h2(x) =
{ 0.667 if x2 ≤ 2.95 −0.222 if x2 > 2.95
27 / 46
Classification Example (M=3)
Step 3c: Modeling from X to −g3 [Model 3]
28 / 46
Classification Example (M=3)
Built a decision tree h3(x) of depth 1 from all X to −g3:
h3(x) =
{ −0.333 if x4 ≤ 1.70 0.667 if x4 > 1.70
29 / 46
Classification Example (M=3)
Step 4: Update three models fm(x) := fm(x) + ρhm(x), setting ρ = 1 for simplicity
f1(x) := f1(x) + h1(x) = 0 + h1(x) =
{ 0.667 if x4 ≤ 0.65 −0.333 if x4 > 0.65
f2(x) := f2(x) + h2(x) = 0 + h2(x) =
{ 0.667 if x2 ≤ 2.95 −0.222 if x2 > 2.95
f3(x) := f3(x) + h3(x) = 0 + h3(x) =
{ −0.333 if x4 ≤ 1.70 0.667 if x4 > 1.70
30 / 46
Classification Example (M=3)
Step 5: Update Ps from the new F s.
31 / 46
Classification Example (M=3)
Step 6: Find negative gradients again.
32 / 46
Classification Example (M=3)
Step 7: Modeling negative gradients again, we have the second set of basic models h1(x), h2(x) and h3(x):
h1(x) =
{ 0.437 if x4 ≤ 0.65 −0.225 if x4 > 0.65
h2(x) =
{ 0.278 if x2 ≤ 2.95 −0.076 if x2 > 2.95
h3(x) =
{ −0.155 if x4 ≤ 1.70 0.296 if x4 > 1.70
Note: the splitting points for this set of basic models are the same as the first set of basic models.
33 / 46
Classification Example (M=3)
Step 8: Final Models (suppose we stop here):
f1(x) := f1(x) + h1(x) =
{ 1.104 if x4 ≤ 0.65 −0.558 if x4 > 0.65
f2(x) := f2(x) + h2(x) =
{ 0.945 if x2 ≤ 2.95 −0.298 if x2 > 2.95
f3(x) := f3(x) + h3(x) =
{ −0.488 if x4 ≤ 1.70 0.963 if x4 > 1.70
34 / 46
Classification Example (M=3)
Step 9: Prediction on x∗ = [4.7, 3.2, 1.3, 0.2]T . For this point we have
F1(x ∗) = 1.104; F2(x
∗) = −0.298; F3(x∗) = −0.488
Hence the probabilities are
P1(x ∗) =
ef1(x ∗)
ef1(x ∗) + ef2(x
∗) + ef3(x ∗)
= 0.690
P2(x ∗) =
ef2(x ∗)
ef1(x ∗) + ef2(x
∗) + ef3(x ∗)
= 0.170
P3(x ∗) =
ef3(x ∗)
ef1(x ∗) + ef2(x
∗) + ef3(x ∗)
= 0.140
Now this case can be classified as Class 1. 35 / 46
Outline
1 Review of Bagging and Boosting
2 Gradient boosting for regression
3 Gradient boosting for multiclass classification
4 A bit of history and a comparison
5 Gradient tree boosting and XGBoost
36 / 46
History of boosting and gradient boosting
Adaboost was invented by Freund et al. (1996), Freund and Schapire (1997) Breiman et al. (1998), Breiman (1999) interpreted Adaboost as a gradient descent algorithm under a special loss function Friedman et al. (2000), Friedman (2001) proposed Gradient Boosting for any loss functions.
Adaboost trains models sequentially, reweights data points, and can be interpreted as gradient descent with the exponential loss function.
Gradient boosting trains models sequentially, considers the negative gradients/residuals, and can be used with any differential loss function.
37 / 46
Some maths: Gradient Boosting∗
Let F (m−1)(xn) be the prediction of the n-th case at the (m − 1)-th iteration, and at m-th iteration, we need to add hm to minimise the following objective
L(m) = N∑
n=1
L(tn,F (m−1)(xn) + hm(xn)) + Ω(hm)
Using Taylor approximation, one has
L(m) ≈ N∑
n=1
[ L(tn,F
(m−1)(xn)) + g(xn)hm(xn) + 1
2 e(xn)h
2 m(xn)
] +Ω(hm)
where
g(xn) = ∂L(tn,F
(m−1)(xn))
∂F (m−1)(xn) e(xn) =
∂2L(tn,F (m−1)(xn))
∂F (m−1)(xn)∂F (m−1)(xn)
38 / 46
Some more maths: Gradient Boosting∗
We write the objective as
L(m) = N∑
n=1
1
2 e(xn)
[ hm(xn) −
( − g(xn)
e(xn)
)]2 + Ω(hm) + const.
Minimising L(m) is equivalent to fitting hm to − g(xn) e(xn)
For example, for the regression with the squared error loss function, we have e(xn) = 1. That is why we fit hm to the negative gradient −g(xn). For the cross entropy loss function, we no longer have e(xn) = 1, but we could still try to fit hm to the negative gradient −g(xn)
39 / 46
Outline
1 Review of Bagging and Boosting
2 Gradient boosting for regression
3 Gradient boosting for multiclass classification
4 A bit of history and a comparison
5 Gradient tree boosting and XGBoost
40 / 46
Gradient Tree Boosting
Using a tree model hm in m-th iteration, Gradient Tree Boosting takes two steps:
Step 1: Fitting the tree hm to − g(xn) e(xn)
. Keep the tree structure,
i.e., the partition of the input space as J regions R1, R2, ..., RJ Step 2: The tree can be expressed as
hm(x) = J∑
j=1
βj 1(x,Rj )
where 1(x,Rj ) =
{ 1 if x ∈ Rj 0 otherwise
Optimize, with respect to all βj , the loss
L(m) ≈ N∑
n=1
[ g(xn)hm(xn) +
1
2 e(xn)h
2 m(xn)
] +
1
2 λ
J∑ j=1
β2j + γJ
41 / 46
Gradient Tree Boosting
Consider the model we are working on
hm(x) = J∑
j=1
βj 1(x,Rj )
Can I say, if the case xn falls in the region R1, then
hm(xn) = β1 and h 2 m(xn) = β
2 1
Let us denote
Ij = {n : such as xn ∈ Rj} Then we have
L(m) ≈ J∑
j=1
∑
n∈Ij
g(xn)
βj + 1
2
∑
n∈Ij
e(xn) + λ
β2j
+ γJ
42 / 46
Gradient Tree Boosting
The new L(m) can be optimized with respect to each βj individually
min βj
∑
n∈Ij
g(xn)
βj + 1
2
∑
n∈Ij
e(xn) + λ
β2j
The best solution for βj is given by
β∗j = − ∑
n∈Ij g(xn)∑ n∈Ij e(xn) + λ
43 / 46
Gradient Tree Boosting
The best objective value is
L(m)∗ = − 1
2
J∑ j=1
(∑ n∈Ij g(xn)
)2 ∑
n∈Ij e(xn) + λ + γJ
This value can be used as a scoring to measure the quality of the current tree. We cannot assess all the possible trees. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used. How? Considering a leaf node I (or a region R), split it into two nodes IL and IR , then the score change will be
Lsplit = 1
2
[ (∑ n∈IL g(xn)
)2∑ n∈IL e(xn) + λ
+
(∑ n∈IR g(xn)
)2∑ n∈IR e(xn) + λ
− (∑
n∈I g(xn) )2∑
n∈I e(xn) + λ
] −γ
we shall split the node which produces the largest splitting score. 44 / 46
XGBoost
XGBoost stands for eXtreme Gradient Boosting, an efficient implementation of gradient tree boosting
XGBoost is proposed by Tianqi Chen for large-scale machine learning in 2014 (now has around 8000 citations)
XGBoost is open-sourced and now available in many languages. There are distributed implementations.
XGBoost is widely used in Kaggle competitions for structured or tabular data and used in many companies.
45 / 46
Recap
We discussed:
gradient boosting: how to handle more general loss functions
gradient boosting for regression and classification
xgboost: extreme gradient boosting.
Next week will be the last formal lecture. We will discuss Neural Networks.
Thank you!
46 / 46
- Review of Bagging and Boosting
- Gradient boosting for regression
- Gradient boosting for multiclass classification
- A bit of history and a comparison
- Gradient tree boosting and XGBoost