INDIVIDUAL ASSIGNMENT

profileerica.m
Lecture05.pdf

Lecture 5 Decision trees

1 / 30

Recap

In the last lecture, we discussed:

Model selection: hold-out, cross-validation, hyperparameter tuning

Decision theory for classification

Clustering: k-means, mixture of Gaussians

Examples: Covid-19 forecasting [Google Cloud and Havard Global Health Institute]: https://cloud.google.com/blog/products/ai-machine- learning/google-cloud-is-releasing-the-covid-19-public-forecasts, Covid-19 mobility report [Google] https://www.google.com/covid19/mobility/

2 / 30

Outline

1 Intro

2 Prediction

3 Learning or growing a decision tree

4 Practical issues

5 Pros and cons

3 / 30

An example

Imagine that we want to partition the input space into axis-aligned, box-like areas [left], where each box is associated with an outcome. We can equivalently represent this using a binary tree [right].

A

B

C D

E

θ1 θ4

θ2

θ3

x1

x2

x1 > θ1

x2 > θ3

x1 6 θ4

x2 6 θ2

A B C D E

Bishop, PRML book, Chapter 14 4 / 30

Introduction

Many decision making systems use a set of if-then-else rules: mimicking how many of us often make decisions

easy to interpret and explain

Decision trees aim to learn a set of if-else rules from data

learn the structure of the tree, aka how to partition the input space

specify a desirable output for each partition/region

Decision trees are very popular in practice. Two algorithms — C4.5 (successor of ID3) and CART — are in the top 10 in this 2007 survey https://link.springer.com/article/10.1007/s10115-007-0114-2.

5 / 30

Introduction

Many decision making systems use a set of if-then-else rules: mimicking how many of us often make decisions

easy to interpret and explain

Decision trees aim to learn a set of if-else rules from data

learn the structure of the tree, aka how to partition the input space

specify a desirable output for each partition/region

Decision trees are very popular in practice. Two algorithms — C4.5 (successor of ID3) and CART — are in the top 10 in this 2007 survey https://link.springer.com/article/10.1007/s10115-007-0114-2.

5 / 30

Introduction

Many decision making systems use a set of if-then-else rules: mimicking how many of us often make decisions

easy to interpret and explain

Decision trees aim to learn a set of if-else rules from data

learn the structure of the tree, aka how to partition the input space

specify a desirable output for each partition/region

Decision trees are very popular in practice. Two algorithms — C4.5 (successor of ID3) and CART — are in the top 10 in this 2007 survey https://link.springer.com/article/10.1007/s10115-007-0114-2.

5 / 30

An example: data

6 / 30

An example: decision tree

7 / 30

Terminologies and simplifying assumptions

We will consider binary trees. They are formed by nodes:

each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.

root node is where we start traversing the tree

leaf node is associate with an output, used for prediction

We will consider a single feature/covariate with a binary decision for each decision node.

Lines that connect nodes are called branches.

The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).

8 / 30

Terminologies and simplifying assumptions

We will consider binary trees. They are formed by nodes:

each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.

root node is where we start traversing the tree

leaf node is associate with an output, used for prediction

We will consider a single feature/covariate with a binary decision for each decision node.

Lines that connect nodes are called branches.

The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).

8 / 30

Terminologies and simplifying assumptions

We will consider binary trees. They are formed by nodes:

each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.

root node is where we start traversing the tree

leaf node is associate with an output, used for prediction

We will consider a single feature/covariate with a binary decision for each decision node.

Lines that connect nodes are called branches.

The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).

8 / 30

Terminologies and simplifying assumptions

We will consider binary trees. They are formed by nodes:

each node (except leaf nodes) is associated with a rule, “if xd > a” or “if xd < b”.

root node is where we start traversing the tree

leaf node is associate with an output, used for prediction

We will consider a single feature/covariate with a binary decision for each decision node.

Lines that connect nodes are called branches.

The outputs at the leaf nodes can be categorical (for classification) or continuous (for regression).

8 / 30

Outline

1 Intro

2 Prediction

3 Learning or growing a decision tree

4 Practical issues

5 Pros and cons

9 / 30

Predicting using a decision tree

Given a tree and a new input, we traverse down the tree and use the output at the resulting leaf node as our prediction

10 / 30

Outline

1 Intro

2 Prediction

3 Learning or growing a decision tree

4 Practical issues

5 Pros and cons

11 / 30

Learning a regression tree Finding ŷ and tree structure

We will first start with regression. Given a tree and input regions corresponding to leaf nodes, the prediction for region Rl :

ŷl = Average{yi, i ∈ Rl},

i.e. prediction = average of all outputs in the region.

We wil next discuss how to find the regions {Rl}Ll=1: the predictions at leave nodes must be similar to the training data, i.e. we still wish to minimise some loss function.

searching over all possible binary trees is computationally intractable, in general.

we opt for for a greedy procedure to split the tree: consider one split or one split at a time.

12 / 30

Learning a regression tree Finding ŷ and tree structure

We will first start with regression. Given a tree and input regions corresponding to leaf nodes, the prediction for region Rl :

ŷl = Average{yi, i ∈ Rl},

i.e. prediction = average of all outputs in the region.

We wil next discuss how to find the regions {Rl}Ll=1: the predictions at leave nodes must be similar to the training data, i.e. we still wish to minimise some loss function.

searching over all possible binary trees is computationally intractable, in general.

we opt for for a greedy procedure to split the tree: consider one split or one split at a time.

12 / 30

Learning a regression tree Greedy procedure

Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.

Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:

R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}

Predictions associated with these regions:

ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}

Similar to other models, we can write down the loss function:∑ i∈R1(j,s)

(yi − ŷ1(j,s))2 + ∑

i∈R1(j,s)

(yi − ŷ2(j,s))2

13 / 30

Learning a regression tree Greedy procedure

Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.

Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:

R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}

Predictions associated with these regions:

ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}

Similar to other models, we can write down the loss function:∑ i∈R1(j,s)

(yi − ŷ1(j,s))2 + ∑

i∈R1(j,s)

(yi − ŷ2(j,s))2

13 / 30

Learning a regression tree Greedy procedure

Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.

Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:

R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}

Predictions associated with these regions:

ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}

Similar to other models, we can write down the loss function:∑ i∈R1(j,s)

(yi − ŷ1(j,s))2 + ∑

i∈R1(j,s)

(yi − ŷ2(j,s))2

13 / 30

Learning a regression tree Greedy procedure

Consider one split at a time, objective is to obtain a model that explains the data as well as possible after a single split. There is no guarantee that this split is globally optimal when taking into account all other splits.

Consider the root node, we want to select feature j from p features and a threshold s which divide the input space into two half-spaces:

R1(j,s) = {x; xj < s} and R2(j,s) = {x; xj ≥ s}

Predictions associated with these regions:

ŷ1(j,s) = Average{yi, i ∈ R1(j,s)} and ŷ2(j,s) = Average{yi, i ∈ R2(j,s)}

Similar to other models, we can write down the loss function:∑ i∈R1(j,s)

(yi − ŷ1(j,s))2 + ∑

i∈R1(j,s)

(yi − ŷ2(j,s))2

13 / 30

Learning a regression tree Greedy procedure (continued)

Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.

We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.

After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...

So, when do we stop?

when each prediction exactly match the training points? this would probably overfit

when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node

14 / 30

Learning a regression tree Greedy procedure (continued)

Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.

We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.

After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...

So, when do we stop?

when each prediction exactly match the training points? this would probably overfit

when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node

14 / 30

Learning a regression tree Greedy procedure (continued)

Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.

We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.

After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...

So, when do we stop?

when each prediction exactly match the training points? this would probably overfit

when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node

14 / 30

Learning a regression tree Greedy procedure (continued)

Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.

We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.

After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...

So, when do we stop?

when each prediction exactly match the training points? this would probably overfit

when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node

14 / 30

Learning a regression tree Greedy procedure (continued)

Remember that we need to find feature j and the corresponding threshold s to minimise the one-step objective function.

We can loop through all features, i.e. set j = 1, . . . ,d, and the available thresholds for each features, and pick the pair (j,s) that minimises the loss.

After this step, the data is devided into two regions R1 and R2. We then apply the same spliting procedure for data in each region. Each will then get divided in two. We will then apply the same spliting procedure for each sub-region...

So, when do we stop?

when each prediction exactly match the training points? this would probably overfit

when a criterion is met, e.g. maximum depth of the tree or minimum number of training points in each leaf node

14 / 30

Learning a regression tree An example

15 / 30

Learning a regression tree Questions?

16 / 30

Learning a classification tree Prediction and objective

Builing a tree for classification is conceptually similar. We need to change the predictions and the objective function.

The prediction for each region is the largest class,

ŷ1(j,s) = MajorityVote{yi, i ∈ R1(j,s)} ŷ2(j,s) = MajorityVote{yi, i ∈ R2(j,s)}

The objective function,

n1Q1 + n2Q2,

where n1 and n2 are the numbers of data points in R1 and R2, and Q could take:

Ql =

 

1 −maxmπ̂l,m [misclassification rate]∑M m=1 π̂l,m(1 − π̂l,m) [Gini index]

− ∑M

m=1 π̂l,m log π̂l,m [entropy]

where π̂l,m is the proportion of class m in region l.

17 / 30

Learning a classification tree Prediction and objective

Builing a tree for classification is conceptually similar. We need to change the predictions and the objective function. The prediction for each region is the largest class,

ŷ1(j,s) = MajorityVote{yi, i ∈ R1(j,s)} ŷ2(j,s) = MajorityVote{yi, i ∈ R2(j,s)}

The objective function,

n1Q1 + n2Q2,

where n1 and n2 are the numbers of data points in R1 and R2, and Q could take:

Ql =

 

1 −maxmπ̂l,m [misclassification rate]∑M m=1 π̂l,m(1 − π̂l,m) [Gini index]

− ∑M

m=1 π̂l,m log π̂l,m [entropy]

where π̂l,m is the proportion of class m in region l.

17 / 30

Learning a classification tree Prediction and objective

Builing a tree for classification is conceptually similar. We need to change the predictions and the objective function. The prediction for each region is the largest class,

ŷ1(j,s) = MajorityVote{yi, i ∈ R1(j,s)} ŷ2(j,s) = MajorityVote{yi, i ∈ R2(j,s)}

The objective function,

n1Q1 + n2Q2,

where n1 and n2 are the numbers of data points in R1 and R2, and Q could take:

Ql =

 

1 −maxmπ̂l,m [misclassification rate]∑M m=1 π̂l,m(1 − π̂l,m) [Gini index]

− ∑M

m=1 π̂l,m log π̂l,m [entropy]

where π̂l,m is the proportion of class m in region l. 17 / 30

Learning a classification tree An example

18 / 30

Example

Goal: learn a tree using the entropy criterion for splitting until number of data points in each leaf node is less than 5.

19 / 30

Example

Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function.

Consider the split x1 < 2.5:

Region R1 (x1 < 2.5): two blues and one reds

Region R2 (x1 > 2.5): three blues and four reds

Entropy for each region:

Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2

3 log

2

3 −

1

3 log

1

3 = 0.64

Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3

7 log

3

7 −

4

7 log

4

7 = 0.68

Objective function for the x1 < 2.5 split:

n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99

20 / 30

Example

Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function. Consider the split x1 < 2.5:

Region R1 (x1 < 2.5): two blues and one reds

Region R2 (x1 > 2.5): three blues and four reds

Entropy for each region:

Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2

3 log

2

3 −

1

3 log

1

3 = 0.64

Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3

7 log

3

7 −

4

7 log

4

7 = 0.68

Objective function for the x1 < 2.5 split:

n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99

20 / 30

Example

Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function. Consider the split x1 < 2.5:

Region R1 (x1 < 2.5): two blues and one reds

Region R2 (x1 > 2.5): three blues and four reds

Entropy for each region:

Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2

3 log

2

3 −

1

3 log

1

3 = 0.64

Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3

7 log

3

7 −

4

7 log

4

7 = 0.68

Objective function for the x1 < 2.5 split:

n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99

20 / 30

Example

Consider 9 potential splits for the root node (dashed lines). We will consider each in turn and compute the objective function. Consider the split x1 < 2.5:

Region R1 (x1 < 2.5): two blues and one reds

Region R2 (x1 > 2.5): three blues and four reds

Entropy for each region:

Q1 = −π̂1B log π̂1B − π̂1R log π̂1R = − 2

3 log

2

3 −

1

3 log

1

3 = 0.64

Q2 = −π̂2B log π̂2B − π̂2R log π̂2R = − 3

7 log

3

7 −

4

7 log

4

7 = 0.68

Objective function for the x1 < 2.5 split:

n1Q1 + n2Q2 = 3 × 0.64 + 7 × 0.68 = 6.99

20 / 30

Example (continued)

Repeating the above for all 9 potential splits gives,

21 / 30

Example (continued)

... we select a feature + threshold and perform a split:

22 / 30

Example (continued)

We repeat the same procedure for R2, noting that all data points in R1 are already belong to the same class and the number of data points in R1,

23 / 30

Example (continued)

... we select a feature + threshold and perform a split:

Done and dusted!

24 / 30

Example (continued)

... we select a feature + threshold and perform a split:

Done and dusted! 24 / 30

Algorithms for decision trees

Classification and Regression trees (CART):

what we have learnt in this lecture

split objective: squared errors or Gini index

handle both categorical and continuous features

binary decision at each node

Iterative Dichotomiser 3 (ID3):

split objective: entropy (what we have learnt) or information gain

handle only categorical features

multi-way decision at each node

C4.5, successor of ID3:

split objective: information gain ratio (avoid split on features with many potential outcomes)

handle both continuous and categorical features

multi-way decision at each node

bespoke pruning strategy 25 / 30

Outline

1 Intro

2 Prediction

3 Learning or growing a decision tree

4 Practical issues

5 Pros and cons

26 / 30

Overfitting and pruning

The impact of the depth:

Shallow tree: make more mistakes on the training set, potentially underfit Deep tree: make less mistakes on the training set, potentially overfit

Can use a validation set to select these hyperparameters/when to stop.

We can also grow the tree until it overfits then slowly remove the leaf nodes. This is called pruning.

Consider the fully-grown tree T0, the pruning objective for a sub-tree T is,

C(T ) = λ|T| + |T|∑ i=1

error at leaf node i

where |T| is the number of leaf nodes of T and λ is a regularsisation parameter. We want to find a subtree that minimises this pruning objective.

27 / 30

Overfitting and pruning

The impact of the depth:

Shallow tree: make more mistakes on the training set, potentially underfit Deep tree: make less mistakes on the training set, potentially overfit

Can use a validation set to select these hyperparameters/when to stop.

We can also grow the tree until it overfits then slowly remove the leaf nodes. This is called pruning.

Consider the fully-grown tree T0, the pruning objective for a sub-tree T is,

C(T ) = λ|T| + |T|∑ i=1

error at leaf node i

where |T| is the number of leaf nodes of T and λ is a regularsisation parameter. We want to find a subtree that minimises this pruning objective.

27 / 30

Overfitting and pruning

The impact of the depth:

Shallow tree: make more mistakes on the training set, potentially underfit Deep tree: make less mistakes on the training set, potentially overfit

Can use a validation set to select these hyperparameters/when to stop.

We can also grow the tree until it overfits then slowly remove the leaf nodes. This is called pruning.

Consider the fully-grown tree T0, the pruning objective for a sub-tree T is,

C(T ) = λ|T| + |T|∑ i=1

error at leaf node i

where |T| is the number of leaf nodes of T and λ is a regularsisation parameter. We want to find a subtree that minimises this pruning objective. 27 / 30

Outline

1 Intro

2 Prediction

3 Learning or growing a decision tree

4 Practical issues

5 Pros and cons

28 / 30

Pros and cons

Advantages:

fast to train and predict

easy to interpret

work well for categorical inputs with well defined thresholds

Disadvantages:

sensitive to changes in data set, i.e. a small change in training data can result in a very different split

axis aligned decision boundaries. An example: two classes, each lives on one half-space of the line x1 = x2.

piece-wise constant prediction for regression → non-smooth

29 / 30

Recap

We discussed decision trees:

binary trees

how to perform prediction

how to grow a tree, with different criteria for regression and classification

In two weeks: we will discuss Bagging and Boosting.

Thank you!

30 / 30

Recap

We discussed decision trees:

binary trees

how to perform prediction

how to grow a tree, with different criteria for regression and classification

In two weeks: we will discuss Bagging and Boosting.

Thank you!

30 / 30

  • Intro
  • Prediction
  • Learning or growing a decision tree
  • Practical issues
  • Pros and cons