INDIVIDUAL ASSIGNMENT
Lecture 5 Decision trees
1 / 30
Recap
In the last lecture, we discussed:
Model selection: hold-out, cross-validation, hyperparameter tuning
Decision theory for classification
Clustering: k-means, mixture of Gaussians
Examples: Covid-19 forecasting [Google Cloud and Havard Global Health Institute]: https://cloud.google.com/blog/products/ai-machine- learning/google-cloud-is-releasing-the-covid-19-public-forecasts, Covid-19 mobility report [Google] https://www.google.com/covid19/mobility/
2 / 30
Outline
1 Intro
2 Prediction
3 Learning or growing a decision tree
4 Practical issues
5 Pros and cons
3 / 30
An example
Imagine that we want to partition the input space into axis-aligned, box-like areas [left], where each box is associated with an outcome. We can equivalently represent this using a binary tree [right].
A
B
C D
E
θ1 θ4
θ2
θ3
x1
x2
x1 > θ1
x2 > θ3
x1 6 θ4
x2 6 θ2
A B C D E
Bishop, PRML book, Chapter 14 4 / 30
Introduction
Many decision making systems use a set of if-then-else rules: mimicking how many of us often make decisions
easy to interpret and explain
Decision trees aim to learn a set of if-else rules from data
learn the structure of the tree, aka how to partition the input space
specify a desirable output for each partition/region
Decision trees are very popular in practice. Two algorithms — C4.5 (successor of ID3) and CART — are in the top 10 in this 2007 survey https://link.springer.com/article/10.1007/s10115-007-0114-2.
5 / 30
Introduction
Many decision making systems use a set of if-then-else rules: mimicking how many of us often make decisions
easy to interpret and explain
Decision trees aim to learn a set of if-else rules from data
learn the structure of the tree, aka how to partition the input space
specify a desirable output for each partition/region
Decision trees are very popular in practice. Two algorithms — C4.5 (successor of ID3) and CART — are in the top 10 in this 2007 survey https://link.springer.com/article/10.1007/s10115-007-0114-2.
5 / 30
Introduction
Many decision making systems use a set of if-then-else rules: mimicking how many of us often make decisions
easy to interpret and explain
Decision trees aim to learn a set of if-else rules from data
learn the structure of the tree, aka how to partition the input space
specify a desirable output for each partition/region
Decision trees are very popular in practice. Two algorithms — C4.5 (successor of ID3) and CART — are in the top 10 in this 2007 survey https://link.springer.com/article/10.1007/s10115-007-0114-2.
5 / 30
An example: data
6 / 30
An example: decision tree
7 / 30
Terminologies and simplifying assumptions
We will consider binary trees. They are formed by nodes:
each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.
root node is where we start traversing the tree
leaf node is associate with an output, used for prediction
We will consider a single feature/covariate with a binary decision for each decision node.
Lines that connect nodes are called branches.
The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).
8 / 30
Terminologies and simplifying assumptions
We will consider binary trees. They are formed by nodes:
each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.
root node is where we start traversing the tree
leaf node is associate with an output, used for prediction
We will consider a single feature/covariate with a binary decision for each decision node.
Lines that connect nodes are called branches.
The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).
8 / 30
Terminologies and simplifying assumptions
We will consider binary trees. They are formed by nodes:
each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.
root node is where we start traversing the tree
leaf node is associate with an output, used for prediction
We will consider a single feature/covariate with a binary decision for each decision node.
Lines that connect nodes are called branches.
The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).
8 / 30
Terminologies and simplifying assumptions
We will consider binary trees. They are formed by nodes:
each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.
root node is where we start traversing the tree
leaf node is associate with an output, used for prediction
We will consider a single feature/covariate with a binary decision for each decision node.
Lines that connect nodes are called branches.
The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).
8 / 30
Outline
1 Intro
2 Prediction
3 Learning or growing a decision tree
4 Practical issues
5 Pros and cons
9 / 30
Predicting using a decision tree
Given a tree and a new input, we traverse down the tree and use the output at the resulting leaf node as our prediction
10 / 30
Outline
1 Intro
2 Prediction
3 Learning or growing a decision tree
4 Practical issues
5 Pros and cons
11 / 30
Learning a regression tree Finding ŷ and tree structure
We will first start with regression. Given a tree and input regions corresponding to leaf nodes, the prediction for region Rl :
ŷl = Average{yi, i ∈ Rl},
i.e. prediction = average of all outputs in the region.
We wil next discuss how to find the regions {Rl}Ll=1: the predictions at leave nodes must be similar to the training data, i.e. we still wish to minimise some loss function.
searching over all possible binary trees is computationally intractable, in general.
we opt for for a greedy procedure to split the tree: consider one split or one split at a time.
12 / 30
Learning a regression tree Finding ŷ and tree structure
We will first start with regression. Given a tree and input regions corresponding to leaf nodes, the prediction for region Rl :
ŷl = Average{yi, i ∈ Rl},
i.e. prediction = average of all outputs in the region.
We wil next discuss how to find the regions {Rl}Ll=1: the predictions at leave nodes must be similar to the training data, i.e. we still wish to minimise some loss function.
searching over all possible binary trees is computationally intractable, in general.
we opt for for a greedy procedure to split the tree: consider one split or one split at a time.
12 / 30
Learning a regression tree Greedy procedure
Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.
Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:
R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}
Predictions associated with these regions:
ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}
Similar to other models, we can write down the loss function:∑ i∈R1(j,s)
(yi − ŷ1(j,s))2 + ∑
i∈R1(j,s)
(yi − ŷ2(j,s))2
13 / 30
Learning a regression tree Greedy procedure
Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.
Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:
R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}
Predictions associated with these regions:
ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}
Similar to other models, we can write down the loss function:∑ i∈R1(j,s)
(yi − ŷ1(j,s))2 + ∑
i∈R1(j,s)
(yi − ŷ2(j,s))2
13 / 30
Learning a regression tree Greedy procedure
Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.
Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:
R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}
Predictions associated with these regions:
ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}
Similar to other models, we can write down the loss function:∑ i∈R1(j,s)
(yi − ŷ1(j,s))2 + ∑
i∈R1(j,s)
(yi − ŷ2(j,s))2
13 / 30
Learning a regression tree Greedy procedure
Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.
Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:
R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}
Predictions associated with these regions:
ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}
Similar to other models, we can write down the loss function:∑ i∈R1(j,s)
(yi − ŷ1(j,s))2 + ∑
i∈R1(j,s)
(yi − ŷ2(j,s))2
13 / 30
Learning a regression tree Greedy procedure (continued)
Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.
We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.
After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...
So, when do we stop?
when each prediction exactly match the training points? this would probably overfit
when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node
14 / 30
Learning a regression tree Greedy procedure (continued)
Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.
We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.
After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...
So, when do we stop?
when each prediction exactly match the training points? this would probably overfit
when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node
14 / 30
Learning a regression tree Greedy procedure (continued)
Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.
We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.
After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...
So, when do we stop?
when each prediction exactly match the training points? this would probably overfit
when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node
14 / 30
Learning a regression tree Greedy procedure (continued)
Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.
We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.
After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...
So, when do we stop?
when each prediction exactly match the training points? this would probably overfit
when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node
14 / 30
Learning a regression tree Greedy procedure (continued)
Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.
We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.
After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...
So, when do we stop?
when each prediction exactly match the training points? this would probably overfit
when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node
14 / 30
Learning a regression tree An example
15 / 30
Learning a regression tree Questions?
16 / 30
Learning a classification tree Prediction and objective
Builing a tree for classification is conceptually similar. We need to change the predictions and the objective function.
The prediction for each region is the largest class,
ŷ1(j,s) = MajorityVote{yi, i ∈ R1(j,s)} ŷ2(j,s) = MajorityVote{yi, i ∈ R2(j,s)}
The objective function,
n1Q1 + n2Q2,
where n1 and n2 are the numbers of data points in R1 and R2, and Q could take:
Ql =
1 −maxmπ̂l,m [misclassification rate]∑M m=1 π̂l,m(1 − π̂l,m) [Gini index]
− ∑M
m=1 π̂l,m log π̂l,m [entropy]
where π̂l,m is the proportion of class m in region l.
17 / 30
Learning a classification tree Prediction and objective
Builing a tree for classification is conceptually similar. We need to change the predictions and the objective function. The prediction for each region is the largest class,
ŷ1(j,s) = MajorityVote{yi, i ∈ R1(j,s)} ŷ2(j,s) = MajorityVote{yi, i ∈ R2(j,s)}
The objective function,
n1Q1 + n2Q2,
where n1 and n2 are the numbers of data points in R1 and R2, and Q could take:
Ql =
1 −maxmπ̂l,m [misclassification rate]∑M m=1 π̂l,m(1 − π̂l,m) [Gini index]
− ∑M
m=1 π̂l,m log π̂l,m [entropy]
where π̂l,m is the proportion of class m in region l.
17 / 30
Learning a classification tree Prediction and objective
Builing a tree for classification is conceptually similar. We need to change the predictions and the objective function. The prediction for each region is the largest class,
ŷ1(j,s) = MajorityVote{yi, i ∈ R1(j,s)} ŷ2(j,s) = MajorityVote{yi, i ∈ R2(j,s)}
The objective function,
n1Q1 + n2Q2,
where n1 and n2 are the numbers of data points in R1 and R2, and Q could take:
Ql =
1 −maxmπ̂l,m [misclassification rate]∑M m=1 π̂l,m(1 − π̂l,m) [Gini index]
− ∑M
m=1 π̂l,m log π̂l,m [entropy]
where π̂l,m is the proportion of class m in region l. 17 / 30
Learning a classification tree An example
18 / 30
Example
Goal: learn a tree using the entropy criterion for splitting until number of data points in each leaf node is less than 5.
19 / 30
Example
Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function.
Consider the split x1 < 2.5:
Region R1 (x1 < 2.5): two blues and one reds
Region R2 (x1 > 2.5): three blues and four reds
Entropy for each region:
Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2
3 log
2
3 −
1
3 log
1
3 = 0.64
Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3
7 log
3
7 −
4
7 log
4
7 = 0.68
Objective function for the x1 < 2.5 split:
n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99
20 / 30
Example
Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function. Consider the split x1 < 2.5:
Region R1 (x1 < 2.5): two blues and one reds
Region R2 (x1 > 2.5): three blues and four reds
Entropy for each region:
Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2
3 log
2
3 −
1
3 log
1
3 = 0.64
Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3
7 log
3
7 −
4
7 log
4
7 = 0.68
Objective function for the x1 < 2.5 split:
n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99
20 / 30
Example
Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function. Consider the split x1 < 2.5:
Region R1 (x1 < 2.5): two blues and one reds
Region R2 (x1 > 2.5): three blues and four reds
Entropy for each region:
Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2
3 log
2
3 −
1
3 log
1
3 = 0.64
Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3
7 log
3
7 −
4
7 log
4
7 = 0.68
Objective function for the x1 < 2.5 split:
n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99
20 / 30
Example
Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function. Consider the split x1 < 2.5:
Region R1 (x1 < 2.5): two blues and one reds
Region R2 (x1 > 2.5): three blues and four reds
Entropy for each region:
Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2
3 log
2
3 −
1
3 log
1
3 = 0.64
Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3
7 log
3
7 −
4
7 log
4
7 = 0.68
Objective function for the x1 < 2.5 split:
n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99
20 / 30
Example (continued)
Repeating the above for all 9 potential splits gives,
21 / 30
Example (continued)
... we select a feature + threshold and perform a split:
22 / 30
Example (continued)
We repeat the same procedure for R2, noting that all data points in R1 are already belong to the same class and the number of data points in R1,
23 / 30
Example (continued)
... we select a feature + threshold and perform a split:
Done and dusted!
24 / 30
Example (continued)
... we select a feature + threshold and perform a split:
Done and dusted! 24 / 30
Algorithms for decision trees
Classification and Regression trees (CART):
what we have learnt in this lecture
split objective: squared errors or Gini index
handle both categorical and continuous features
binary decision at each node
Iterative Dichotomiser 3 (ID3):
split objective: entropy (what we have learnt) or information gain
handle only categorical features
multi-way decision at each node
C4.5, successor of ID3:
split objective: information gain ratio (avoid split on features with many potential outcomes)
handle both continuous and categorical features
multi-way decision at each node
bespoke pruning strategy 25 / 30
Outline
1 Intro
2 Prediction
3 Learning or growing a decision tree
4 Practical issues
5 Pros and cons
26 / 30
Overfitting and pruning
The impact of the depth:
Shallow tree: make more mistakes on the training set, potentially underfit Deep tree: make less mistakes on the training set, potentially overfit
Can use a validation set to select these hyperparameters/when to stop.
We can also grow the tree until it overfits then slowly remove the leaf nodes. This is called pruning.
Consider the fully-grown tree T0, the pruning objective for a sub-tree T is,
C(T ) = λ|T| + |T|∑ i=1
error at leaf node i
where |T| is the number of leaf nodes of T and λ is a regularsisation parameter. We want to find a subtree that minimises this pruning objective.
27 / 30
Overfitting and pruning
The impact of the depth:
Shallow tree: make more mistakes on the training set, potentially underfit Deep tree: make less mistakes on the training set, potentially overfit
Can use a validation set to select these hyperparameters/when to stop.
We can also grow the tree until it overfits then slowly remove the leaf nodes. This is called pruning.
Consider the fully-grown tree T0, the pruning objective for a sub-tree T is,
C(T ) = λ|T| + |T|∑ i=1
error at leaf node i
where |T| is the number of leaf nodes of T and λ is a regularsisation parameter. We want to find a subtree that minimises this pruning objective.
27 / 30
Overfitting and pruning
The impact of the depth:
Shallow tree: make more mistakes on the training set, potentially underfit Deep tree: make less mistakes on the training set, potentially overfit
Can use a validation set to select these hyperparameters/when to stop.
We can also grow the tree until it overfits then slowly remove the leaf nodes. This is called pruning.
Consider the fully-grown tree T0, the pruning objective for a sub-tree T is,
C(T ) = λ|T| + |T|∑ i=1
error at leaf node i
where |T| is the number of leaf nodes of T and λ is a regularsisation parameter. We want to find a subtree that minimises this pruning objective. 27 / 30
Outline
1 Intro
2 Prediction
3 Learning or growing a decision tree
4 Practical issues
5 Pros and cons
28 / 30
Pros and cons
Advantages:
fast to train and predict
easy to interpret
work well for categorical inputs with well defined thresholds
Disadvantages:
sensitive to changes in data set, i.e. a small change in training data can result in a very different split
axis aligned decision boundaries. An example: two classes, each lives on one half-space of the line x1 = x2.
piece-wise constant prediction for regression → non-smooth
29 / 30
Recap
We discussed decision trees:
binary trees
how to perform prediction
how to grow a tree, with different criteria for regression and classification
In two weeks: we will discuss Bagging and Boosting.
Thank you!
30 / 30
Recap
We discussed decision trees:
binary trees
how to perform prediction
how to grow a tree, with different criteria for regression and classification
In two weeks: we will discuss Bagging and Boosting.
Thank you!
30 / 30
- Intro
- Prediction
- Learning or growing a decision tree
- Practical issues
- Pros and cons