regression using R

profilefinal
ReferenceforAssignment.pdf

Review: training and test error

We can evaluate the prediction error of our model either on the

training dataset (known as training error) or on a dataset not used

for training the model (known as test error).

When training (fitting) a supervised learning model, the algorithm’s

goal is to minimize training error.

However, what we really want to do is to minimize the test error, or

the prediction error on data we have not observed yet.

Overfitting is an issue, because we can begin modeling noise in the

training data rather than only the underlying function.

This can lead to low training error, but high test error.

Review: illustration of training vs. test error

Below is prediction error as a function of polynomial degree 𝑀 for both training and test datasets:

Training error always decreases with increasing model complexity

(flexibility), whereas test error has a U-shaped curve.

Bishop (2006) Pattern Recognition and Machine Learning. Springer.

Prediction

error

How do we

identify 𝑴 such that the inferred

model has good

test error?

How do we minimize test error?

We have two main goals:

1) Model selection – comparing the performance of different

models to select the best one

2) Model assessment – estimating test error of the selected model

If we have a lot of data, then the optimal approach for both of these is

to divide the dataset uniformly at random into three parts:

Training set

Validation set

Test set

Training, validation, and test sets

Training set: used for fitting the models

Validation set: used to estimate prediction error for model selection

Test set: used for assessing prediction (test) error on new data

The test set should only be used at the end of the analysis for model

assessment, and not for choosing the best model.

There is no general rule for how to split the dataset, but a typical split

is 50% for training, 25% for validation, and 25% for test.

Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.

Do we need to split data into three parts?

For many datasets it may not be possible.

That is, there may not be enough data to create a separate test

dataset without decreasing the sample size such that it is deleterious

to model selection.

Model selection is critical, and often our primary concern is training a

method that will perform as well as possible on unseen data.

We will therefore consider methods that split the dataset into only

training and validation sets.

In these methods, the validation set is used for estimating test error.

The validation (hold-out) set approach

The validation (or hold-out) set approach involves dividing the training

dataset uniformly at random into roughly two equally-sized parts: the

training set and the validation set (known as the hold-out set).

The model is rained on the traning se

The model is trained on the training set (blue), and the test error of

the trained model is estimated from the validation set (orange).

Models of different complexity can be trained and their test error

evaluated, and then the best-performing model can be selected.

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Validation set approach on polynomial regression

Validation (test) error for a single split into training and validation sets:

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Validation set approach on polynomial regression

Validation (test) error for different splits of the data:

Test error is highly variable across splits, and depends on which

observations are included in the training vs. validation sets.

This variation occurs because the validation set approach trains the

model on only half of the data (small sample).

Methods tend to perform worse when trained on fewer observations,

suggesting validation error overestimates test error on entire dataset. James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Overcoming pitfalls of the validation set approach

We need the learned model to be more robust to variability in the

choice of training and validation set.

We also want the trained model to be based on a large sample size.

The solution to this problem is cross-validation (CV), which involves

repeatedly splitting the data into training and test sets, and then

assessing mean performance across iterations.

There are tow major CV approaches:

1) Leave-one-out cross-validation (LOOCV)

2) 𝑘-fold cross validation

Leave-one-out cross-validation (LOOCV)

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Leave-one-out cross-validation (𝐋𝐎𝐎𝐂𝐕) splits the data into a training set of 𝑛 − 1 observations and a validation set of the remaining 1 observation:

Leave-one-out cross-validation (LOOCV)

We start by removing training observation (𝑥1,𝑦1), and using only observations {(𝑥2,𝑦2),(𝑥3,𝑦3),…,(𝑥𝑛,𝑦𝑛)} for training.

The model is fit to the training data, and a prediction is made for

training observation (𝑥1,𝑦1).

This validation (test) error of training observation (𝑥1,𝑦1) is:

MSE1 = 𝑦1 − ො𝑦1 2

This estimate of the MSE of the method will be highly variable

because it is based on only a single observation.

Leave-one-out cross-validation (LOOCV)

We repeat this process for each of the 𝑛 training observations, such that we remove training observation (𝑥𝑖,𝑦𝑖), and use observations { 𝑥1,𝑦1 ,…, 𝑥𝑖−1,𝑦𝑖−1 , (𝑥𝑖+1,𝑦𝑖+1),…,(𝑥𝑛,𝑦𝑛)} for training.

The model is fit to the training data, and a prediction is made for

training observation (𝑥𝑖,𝑦𝑖).

The validation (test) error of training observation (𝑥𝑖,𝑦𝑖) is:

MSE𝑖 = 𝑦𝑖 − ො𝑦𝑖 2

When we have computed MSE for all 𝑛 training observations, we compute the LOOCV validation (test) error as:

CV(𝑛) = MSE1 + MSE2 + ⋯+ MSE𝑛

𝑛

Advantages of LOOCV over validation set

LOOCV has substantially less bias because the model is repeatedly

trained on a large sample set of size 𝑛 − 1 training observations, whereas the validation set approach trains models with a sample size

of approximately 𝑛/2 training observations.

Because of this, LOOCV tends to not overestimate the test error.

Moreover, LOOCV always yields the same test error and inferred

model, compared to variation attributed to different splits of the data

with the validation set approach.

However, LOOCV is computationally expensive, because the model

needs to be trained 𝑛 times.

LOOCV approach on polynomial regression

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

𝑘-fold CV

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

𝒌-fold cross-validation (𝒌−fold 𝐂𝐕) randomly splits data into 𝑘 “folds”, with a training set of 𝑘 − 1 folds and a test set of the remaining 1 fold.

𝑘-fold cross-validation (CV)

This process is repeated for each set (just like with LOOCV):

When set 𝑖 is held out for validation, the remaining 𝑘 − 1 sets (sets 1,…,𝑖 − 1,𝑖 + 1,…,𝑘) are used for training.

The MSE across the observations in the 𝑖th set is used to measure the validation (test) error, and is denoted by MSE𝑖, and the overall 𝑘- fold CV test error is computed as

CV(𝑘) = MSE1 + MSE2 + ⋯+ MSE𝑘

𝑘

Computational savings with 𝑘-fold CV

LOOCV is a special case of 𝑘-fold CV with 𝑘 = 𝑛.

Though LOOCV is superior to 𝑘-fold CV (𝑘 < 𝑛), it is also computationally expensive to train the model 𝑛 times.

In practice, we usually use 𝑘 = 5 or 𝑘 = 10 for 𝑘-fold CV.

This makes 𝑘-fold CV (𝑘 < 𝑛) much less computationally intensive than LOOCV.

In such cases, we only have to train 5 or 10 models, rather than 𝑛, which could lead to enormous savings for methods that are

computationally intensive to fit.

LOOCV vs. 10-fold CV on polynomial regression

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

The overall performance is similar between LOOCV and 10-fold CV:

Though there is some variability across the 10-fold CV estimates, it is much lower than the variability from the validation set approach.

Non-computational advantages of 𝑘-fold CV (𝑘 < 𝑛)

Because LOOCV uses a greater number of observations in each

training set, its model estimates will be based on a larger sample

size, and should exhibit smaller bias relative to 𝑘-fold CV (𝑘 < 𝑛).

However, LOOCV has a higher variance than 𝑘-fold CV (𝑘 < 𝑛).

LOOCV is essentially averaging the output of 𝑛 fitted models, each of which is highly positively correlated with the others (as almost all

training observations overlap).

𝑘-fold CV is averaging the output of 𝑘 fitted models, each of which is partially correlated with the others.

Non-computational advantages of 𝑘-fold CV (𝑘 < 𝑛)

The mean of many highly correlated quantities has higher variance

than the mean of many quantities that are less correlated.

Because of this, the test error estimates resulting from 𝑘-fold CV (𝑘 < 𝑁) tend to be smaller than those from LOOCV.

Therefore, there is a bias-variance trade-off depending on the chosen

value of 𝑘 in 𝑘-fold CV.

For these reasons, 5- and 10-fold CV is often performed, as they

have been demonstrated empirically to provided estimates of the test

error that do not experience high bias or high variance.

Feature selection

A related problem to model selection is feature selection, which

involves determining which features are associated with the response

so that we can fit a model with only those features.

In feature selection, we seek to minimize test error, while also

minimizing the number of features used to make predictions.

After feature selection, chosen features would hopefully be the

important ones for making predictions about the response.

Irrelevant features not associated with the response are discarded.

Redundant (correlated) features are also discarded.

Why perform feature selection?

Feature selection can enhance interpretability of the fitted model

because fewer features are associated with the response.

The reduction of model parameters may lead to better generalization

or test error due to a reduction in overfitting.

Including fewer features can reduce the time it takes to train models,

as training time increases with increased dimensionality (more

complexity/ features).

Subset selection methods

We begin by considering subset selection methods.

These methods try to identify a subset of the 𝑝 features that will lead to reduced prediction error.

We will consider three approaches:

1) Best subset selection – compare all possible subsets of features

2) Forward selection – start with null model (no features) and add

3) Backward selection – start with full model (all features) and subtract

Best subset selection

1. Let ℳ0 denote the null model that contains no features, which is the model with only parameter 𝛽0.

2. For each 𝑘 = 1,2,…,𝑝:

(a) Fit all models of exactly 𝑘 features.

(b) Pick the best (smallest RSS) among these models, and call it

ℳ𝑘.

3. Select a single best model from {ℳ0,ℳ1,…ℳ𝑝} based on model

selection (e.g., cross-validation).

Adapted from James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

This is the ideal approach because we are comparing every

possible subset of features!

Illustration of best subset selection

RSS for dataset with 𝑝 = 8 features:

Choose model with lowest RSS among all sets of models.

The best model contains all 𝑝 = 8 features in the training set.

Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.

Problem with best subset selection

In best subset selection, we must perform least squares regression

for all subsets with 0,1,2,…,𝑝 features.

The set of all subsets is known as the power set, which is

exponentially increasing in size of the original set.

Because we have 𝑝 features, the total number of fitted models is 2𝑝, which is enormous, and computationally infeasible for anything but a

small number of features (maximum 𝑝 between 30 and 40 based on an efficient algorithm of Furnival and Wilson 1974).

For example, even when 𝑝 = 20 (small number of features for modern datasets), we must consider 220 = 1,048,576 models!

Furnival, Wilson (1974) Technometrics 16:499-511.

How can we overcome this computational burden?

Best subset selection is simply not computationally feasible for many

contemporary datasets.

The other two approaches that we will discuss examine a far more

restricted space compared to best subset selection:

2) Forward selection – start with null model (no features) and add

3) Backward selection – start with full model (all features) and subtract

Forward selection

1. Let ℳ0 denote the null model that contains no features, which is the model with only parameter 𝛽0.

2. For each 𝑘 = 0,1,…,𝑝 − 1:

(a) Fit all 𝑝 − 𝑘 models that augment the features in ℳ𝑘 with one additional feature.

(b) Pick the best (smallest RSS) among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1.

3. Select a single best model from {ℳ0,ℳ1,…ℳ𝑝} based on model

selection (e.g., cross validation).

Adapted from James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Backward selection

1. Let ℳ𝑝 denote the full model that contains all 𝑝 features.

2. For each 𝑘 = 𝑝,𝑝 − 1,…,1:

(a) Fit all 𝑘 models that have all but one of the features in ℳ𝑘, for a total of 𝑘 − 1 features.

(b) Pick the best (smallest RSS) among these 𝑘 models, and call it ℳ𝑘−1.

3. Select a single best model from {ℳ0,ℳ1,…ℳ𝑝} based on model

selection (e.g., cross validation).

Adapted from James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Number of fitted models with forward & backward selection

The number of fitted models for either forward selection or backward

selection is:

1 + 𝑝 𝑝 + 1

2

Thus, they are equivalent in computational efficiency.

This number is also much smaller than the 2𝑝 models required by best subset selection.

For example, with 𝑝 = 20, best subset selection requires 1,048,576 fitted models, while forward and backward selection require only 211.

How to choose between forward & backward selection?

First, it is important to point out that both methods may not yield the

optimal solution, which can only be found by best subset selection.

The optimal solutions for forward and backward stepwise selection

also may differ.

Backward subset selection requires 𝑝 ≤ 𝑛, so that the full model can be fit without issue.

Forward subset selection can be used even when 𝑝 > 𝑛, as the algorithm can just stop at the subset of size 𝑘 = 𝑛 features.

However, forward selection is “greedy” and might include features

early that later become less important.

Can we get closer to best subset selection?

A hybrid approach between forward and backward stepwise selection

can be taken to get a better approximation to best subset selection.

With such an approach, features are added iteratively through

forward stepwise selection.

After each new feature is added, the method can remove any feature

from the model that no longer improves model fit.

Therefore, features added earlier in the process may be removed if

they are no longer important conditional on the set of added features

that were added later in the process.

This alleviates the “greedy” issue of forward selection.

Comparison of subset selection approaches

𝑁 = 30 observations with 𝑝 = 31 standard normal features with pairwise correlations of 0.85:

Adapted from Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.

Shrinkage methods as alternative to subset selection

Subset selection methods retain a subset of the features and discard

the remainder, often leading to a more interpretable model.

However, this discrete process of retaining or discarding features

often leads to high variance, and therefore does not reduce the

prediction error as much as hoped from the full model.

An alternative approach is to fit a model with all features, then shrink

estimated parameters toward zero relative to least squares estimates.

This can reduce variance of estimates, as well as potentially remove

(estimate parameters to be zero) unimportant features.

This approach is called regularization because it constrains, or

regularizes, parameter estimates.

Regularization techniques

Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by

minimizing the cost function 𝐽 𝛽 :

𝐽 𝛽 = RSS = ෍

𝑖=1

𝑛

(𝑦𝑖 − ො𝑦𝑖) 2

Regularization shrinks estimates by minimizing a penalized variation.

We will discuss three common techniques:

1) Ridge regression

2) Lasso

3) Elastic net

Regularization techniques

Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by

minimizing the cost function 𝐽 𝛽 :

𝐽 𝛽 = RSS = ෍

𝑖=1

𝑛

(𝑦𝑖 − ො𝑦𝑖) 2

Regularization shrinks estimates by minimizing a penalized variation.

We will discuss three common techniques:

1) Ridge regression

2) Lasso

3) Elastic net

Ridge regression

Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by

minimizing the cost function 𝐽 𝛽 :

𝐽 𝛽 = RSS = ෍

𝑖=1

𝑛

(𝑦𝑖 − ො𝑦𝑖) 2

Ridge regression instead minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 2 ,

where the second term is a shrinkage (L2) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.

How ridge regression works

Ridge regression minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 2 ,

where the second term is a shrinkage (L2) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.

The shrinkage penalty is smallest when 𝛽1 …𝛽𝑝 are close to zero, thus

shrinking their estimates close to zero.

The shrinkage penalty has no influence when 𝜆 = 0 (least squares regression), and increasing influence as 𝜆 → ∞.

Choosing the tuning parameter 𝜆.

The original formulation of least squares regression minimizes the cost

function 𝐽(𝛽) = RSS, leading to a single set of parameter estimates.

However, ridge regression leads to different sets of estimates across

values of the tuning parameter 𝜆.

We choose the best set of parameter estimates under ridge regression

through cross-validation.

That is, we identify መ𝜆 through cross-validation that leads to the best መ𝛽.

We then estimate parameters based on this tuning parameter መ𝜆.

Example behavior of parameter estimates with increasing 𝜆

Parameter estimates (coefficients) for predicting credit card balance

from income, credit limit, credit rating, and student status as a

function of 𝜆 :

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Tuning parameter creates bias-variance tradeoff

The tuning parameter 𝜆 modifies model complexity.

That is, as the tuning parameter increases, model complexity

decreases, and we have fewer effective parameters.

This leads to a greater bias but a decreased variance.

The optimal 𝜆 value is chosen such that there is an optimal tradeoff between bias and variance, yielding improved test error.

Choosing the optimal value of the tuning parameter 𝜆

Depicted are MSEs (purple), squared bias (black), variance (green),

and irreducible error (dotted black) for different values of 𝜆:

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Optimal MSE (purple X)

occurs at intermediate

value of 𝜆

Advantages of ridge regression over other approaches

Least squares estimates have increasing variance as 𝑝 → 𝑛, and do not even have unique solutions when 𝑝 > 𝑛.

Ridge regression can still perform well when 𝑝 > 𝑛, trading small increases in bias for potentially larger decreases in variance.

Moreover, we must fit 2𝑝 models for best subset selection, and 1 + 𝑝(𝑝 + 1)/2 models for forward and backward selection.

In contrast, ridge regression only requires us to fit a single model for

any value of 𝜆, which can be done quickly.

But there is one obvious disadvantage to ridge regression relative to

all subset selection methods—it includes all 𝑝 features in the model.

Regularization techniques

Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by

minimizing the cost function 𝐽 𝛽 :

𝐽 𝛽 = RSS = ෍

𝑖=1

𝑛

(𝑦𝑖 − ො𝑦𝑖) 2

Regularization shrinks estimates by minimizing a penalized variation.

We will discuss three common techniques:

1) Ridge regression

2) Lasso

3) Elastic net

Lasso is very similar in form to ridge regression

Ridge regression minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 2 ,

where the second term is a shrinkage (L2) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.

Lasso instead minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 ,

where the second term is a shrinkage (L1) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.

Lasso also works similarly to ridge regression

Lasso minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 ,

where the second term is a shrinkage (L1) penalty, and 𝜆 ≥ 0 is a tuning parameter that is selected by cross-validation.

As with ridge regression, the shrinkage penalty is smallest when 𝛽1 …𝛽𝑝 are close to zero, and has increasing influence as 𝜆 → ∞.

But unlike ridge regression, lasso can perform feature selection by

forcing some estimates to be exactly zero when is 𝜆 sufficiently large.

Tuning parameter creates bias-variance tradeoff

The tuning parameter 𝜆 modifies model complexity.

That is, as the tuning parameter increases, model complexity

decreases, and we have fewer effective parameters.

This leads to a greater bias but a decreased variance.

In addition, the decreased variance is associated with some features

becoming discarded as the tuning parameter increases in value.

The optimal 𝜆 value is chosen such that there is an optimal tradeoff between bias and variance, yielding improved test error.

Depicted are MSEs (purple), squared bias (black), and variance

(green) for different values of 𝜆:

Tuning parameter makes bias-variance tradeoff

James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.

Optimal MSE (purple X)

again occurs at

intermediate value of 𝜆

Ridge regression vs. lasso: Which is better?

Some of the same advantages of ridge regression also apply to lasso

(perform well when 𝑝 > 𝑛 unlike least squares, only need to fit a single model for each 𝜆 unlike subset selection methods).

Though ridge regression and lasso may outcompete these other

methods, neither one is always better than the other.

Ridge regression will generally perform better when the response is a

function of many features (because it must include all features).

Lasso will generally perform better when the response is a function of

few features (because it can perform feature selection).

However, we do not typically know how many features are related to the

response a priori for real data.

Comparison of approaches for prostate cancer prediction

Parameter estimates and test errors for least squares (LS), best

subset selection, ridge regression, and lasso:

Adapted from Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.

Comparison of approaches for prostate cancer prediction

Parameter estimates and test errors for least squares (LS), best

subset selection, ridge regression, and lasso:

Which approach is best? What does this say about the data?

Adapted from Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.

?

What if the best solution lies between ridge and lasso?

The best solution may not be the one determined by ridge regression,

or by lasso, but rather may be somewhere in between the two.

In such cases, combining ridge regression and lasso may increase

predictive ability.

This solution, termed elastic net, was proposed such that both ridge

regression and lasso are nested models within it.

Regularization techniques

Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by

minimizing the cost function 𝐽 𝛽 :

𝐽 𝛽 = RSS = ෍

𝑖=1

𝑛

(𝑦𝑖 − ො𝑦𝑖) 2

Regularization shrinks estimates by minimizing a penalized variation.

We will discuss three common techniques:

1) Ridge regression

2) Lasso

3) Elastic net

Elastic net combines ridge regression and lasso

Ridge regression minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 2

Lasso minimizes:

𝐽 𝛽,𝜆 = RSS + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗

Elastic net minimizes:

𝐽 𝛽,𝜆,𝛼 = RSS + 𝜆 1 − 𝛼 ෍

𝑗=1

𝑝

𝛽𝑗 2 + 𝛼 ෍

𝑗=1

𝑝

𝛽𝑗 ,

where the second term is a shrinkage penalty, 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization, and 0 ≤ 𝛼 ≤ 1 is another tuning parameter that determines the influence of ridge

regression and lasso on the estimated parameters.

Elastic net works similarly to ridge regression and lasso

Elastic net minimizes:

𝐽 𝛽,𝜆,𝛼 = RSS + 𝜆 1 − 𝛼 ෍

𝑗=1

𝑝

𝛽𝑗 2 + 𝛼 ෍

𝑗=1

𝑝

𝛽𝑗 ,

where the second term is a shrinkage penalty, and 𝜆 ≥ 0 and 0 ≤ 𝛼 ≤ 1 are tuning parameters that are selected by cross-validation.

As with ridge regression and lasso, the shrinkage penalty is smallest

when 𝛽1 …𝛽𝑝 are close to zero, and has increasing influence as 𝜆 → ∞.

The tuning parameter 𝛼 determines the influence of ridge regression and lasso on the estimated parameters.

We obtain ridge regression when 𝛼 = 0, and lasso when 𝛼 = 1.

Comparison of approaches for prostate cancer prediction

Parameter predictions and test errors for ordinary least squares

(OLS), ridge regression, lasso, naïve elastic net, and elastic net

(improved performance, citation below):

Note that 𝑠 is the fraction of the L1 norm 𝛽𝑗 , which is in the L1

penalty term in lasso.

Zou and Hastie (2005) J. R. Statist. Soc. B. 67:301-320.

Cross-validated ridge and lasso regression in R

R of course has the machinery for us to easily perform ridge and

lasso regression together with cross-validation to determine the

appropriate tuning parameter 𝜆 (i.e., model complexity).

We first need to install and load the library for the glmnet package,

which provides an implementation of elastic net (and therefore lasso

and ridge) regression together with cross-validation.

To install and load glmnet, use the command:

> install.packages("glmnet")

> library(glmnet)

We will apply these methods to the Advertising dataset from our

lecture on linear regression.

Review: Advertising dataset

We will next consider the Advertising dataset, which can be

downloaded from https://www.statlearning.com/resources-first-edition.

This dataset is tidied into a data frame with 𝑛 = 200 market observations (rows) on 𝑝 = 4 features (columns):

sales Sales (thousands of units)

TV TV advertising budget (thousands of dollars)

radio Radio advertising budget (thousands of dollars)

newspaper Newspaper advertising budget (thousands of dollars)

Our goal is to predict sales from the other three features.

Creating the response and input feature matrices

The glmnet package requires that response and feature datasets be

provided separately and as matrices instead of data frames.

We can easily convert a data frame to a matrix with the as.matrix() function in R.

Using this function, we can obtain the response (𝑌) and feature (𝑋) matrices for use in the glmnet package:

> Y <- Advertising %>% select(sales) %>% as.matrix()

> X <- Advertising %>% select(TV, radio, newspaper) %>%

as.matrix()

The response and input feature matrices

We can use the head() function to peek at the first few rows of the

response and feature matrices:

> head(Y)

sales

[1,] 22.1

[2,] 10.4

[3,] 9.3

[4,] 18.5

[5,] 12.9

[6,] 7.2

> head(X)

TV radio newspaper

[1,] 230.1 37.8 69.2

[2,] 44.5 39.3 45.1

[3,] 17.2 45.9 69.3

[4,] 151.5 41.3 58.5

[5,] 180.8 10.8 58.4

[6,] 8.7 48.9 75.0

Specifying the set of tuning parameter (𝜆) values

Our goal is to identify the “best” 𝜆 and corresponding model through cross-validation.

Thus, we need to ensure that we explore a range of 𝜆 values that is wide enough to capture all important levels of model complexity.

We will consider 100 𝜆 values that span a range from 0.001 to 1000 on a logarithmic scale.

To do this, we will use the seq() function together with some

transformations:

> lambdas <- 10^seq(-3, 3, length.out = 100)

Obtaining ridge parameter estimates as a function of 𝜆

We can examine the set of all ridge regression parameter estimates as a function of the 𝜆 values in lambdas.

Specifically, we want to use ridge regression (𝛼 = 0) to estimate the four model parameters corresponding to the intercept and coefficients for TV, radio, and newspaper for each of the 100 𝜆 values.

To do this, we use the glmnet() function and set the argument

alpha (𝛼) to zero and provide the argument lambda (𝜆) our set of tuning parameter values lambdas:

> ridge.fit <- glmnet(X, Y, alpha = 0, lambda = lambdas)

Plotting parameter estimates as a function of 𝜆

The package glmnet comes with its own useful plotting functions.

We can view how parameter estimates shrink as 𝜆 increases:

> plot(ridge.fit, xvar = "lambda")

> legend("topright", lwd=1, col = 1:3, legend = colnames(X))

Using 𝑘-fold CV to fit a ridge regression model

We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.

Specifically, we want to use ridge regression (𝛼 = 0) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.

To do this, we use the cv.glmnet() function and set the argument

alpha (𝛼) to zero, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds (𝑘) argument to 10:

> ridge.cv <- cv.glmnet(X, Y, alpha = 0, lambda = lambdas,

nfolds = 10)

Plotting the cross-validation error rate as a function of 𝜆

We can view the cross-validation error rate for ridge regression as a

function of 𝜆 with the command:

> plot(ridge.cv)

Numbers at the top show the number of selected features, which take

values of 0,1,…,𝑝 (ridge regression always retains all 𝑝 features).

Estimating the best model (smallest CV error rate)

We now would like to find the model with the smallest CV error rate.

To do so, we must first select the 𝜆 with the minimum CV error rate across the range of values that we explored.

Given our cross-validated object ridge.cv, we can access the 𝜆

with the minimum CV error rate using the command ridge.cv$lambda.min.

We can then find the best-fit ridge regression model by typing:

> ridge.best <- glmnet(X, Y, alpha = 0,

lambda = ridge.cv$lambda.min)

Obtaining parameter estimates

We can view the parameter estimates using the coef() function:

> coef(ridge.best)

4 x 1 sparse Matrix of class "dgCMatrix"

s0

(Intercept) 2.9906421059

TV 0.0455213515

radio 0.1874124145

newspaper -0.0007094929

This is almost identical to the least squares regression estimates:

> lm.fit

Call:

lm(formula = sales ~ TV + radio + newspaper, data = Advertising)

Coefficients:

(Intercept) TV radio newspaper

2.938889 0.045765 0.188530 -0.001037

Making predictions with the fitted model

Given our fitted model ridge.best, we can now make predictions

from new data with the predict() function in glmnet.

We can make predictions from the best-fit ridge regression model as:

> predict(ridge.best, as.matrix(tibble(TV = 1000,

radio = 2000, newspaper = 3000)))

s0

[1,] 421.2083

This is almost identical to our prediction with least squares:

> predict(lm.fit, tibble(TV = 1000, radio = 2000,

newspaper = 3000))

1

422.6511

Obtaining lasso parameter estimates as a function of 𝜆

The procedure for performing lasso regression in glmnet is akin to

ridge, except we set the argument alpha (𝛼) to one instead of zero:

> lasso.fit <- glmnet(X, Y, alpha = 1, lambda = lambdas)

> plot(lasso.fit, xvar = "lambda")

> legend("topright", lwd=1, col = 1:3, legend = colnames(X))

Using 𝑘-fold CV to fit a lasso regression model

We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.

Specifically, we want to use lasso (𝛼 = 1) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.

To do this, we use the cv.glmnet() function and set the argument

alpha (𝛼) to one, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds (𝑘) argument to 10:

> lasso.cv <- cv.glmnet(X, Y, alpha = 1, lambda = lambdas,

nfolds = 10)

Plotting the cross-validation error rate as a function of 𝜆

We can view the cross-validation error rate of lasso as a function of 𝜆 with the command:

> plot(lasso.cv)

Recall that numbers at the top show the number of selected features,

which take values of 0,1,…,𝑝.

Estimating the best model (smallest CV error rate)

We now would like to find the model with the smallest CV error rate.

To do so, we must first select the 𝜆 with the minimum CV error rate across the range of values that we explored.

Given our cross-validated object ridge.cv, we can access the 𝜆

with the minimum CV error rate using the command ridge.cv$lambda.min.

We can then find the best-fit lasso regression model by typing:

> lasso.best <- glmnet(X, Y, alpha = 1,

lambda = lasso.cv$lambda.min)

Obtaining parameter estimates

We can view the parameter estimates using the coef() function:

> coef(lasso.best)

4 x 1 sparse Matrix of class "dgCMatrix"

s0

(Intercept) 3.07623499

TV 0.04520385

radio 0.18480819

newspaper .

These are similar to ridge regression with newspaper removed:

> coef(ridge.best)

4 x 1 sparse Matrix of class "dgCMatrix"

s0

(Intercept) 2.9906421059

TV 0.0455213515

radio 0.1874124145

newspaper -0.0007094929

Making predictions with the fitted model

Given our fitted model lasso.best, we can now make predictions

from new data with the predict() function in glmnet.

We can make predictions from the best-fit lasso regression model as:

> predict(lasso.best, as.matrix(tibble(TV = 1000,

radio = 2000, newspaper = 3000)))

s0

[1,] 417.8965

This is similar to our prediction from the best-fit ridge regression model:

> predict(ridge.best, as.matrix(tibble(TV = 1000,

radio = 2000, newspaper = 3000)))

s0

[1,] 421.2083

Application of regularization to classification problems

We can also apply penalization to logistic regression!

The ridge-penalized logistic regression cost function is:

𝐽 𝛽,𝜆 = −log𝐿(𝛽) + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗 2

The lasso-penalized logistic regression cost function is:

𝐽 𝛽,𝜆 = −log𝐿(𝛽) + 𝜆 ෍

𝑗=1

𝑝

𝛽𝑗

The elastic net-penalized logistic regression cost function is:

𝐽 𝛽,𝜆,𝛼 = −log𝐿(𝛽) + 𝜆 1 − 𝛼 ෍

𝑗=1

𝑝

𝛽𝑗 2 + 𝛼 ෍

𝑗=1

𝑝

𝛽𝑗

Cross-validated ridge and lasso logistic regression in R

Recall that R has the machinery for us to easily perform ridge and

lasso logistic regression together with cross-validation to determine

the appropriate tuning parameter 𝜆 (i.e., model complexity) using the elastic net model in the glmnet package.

To load glmnet, use the command:

> library(glmnet)

We will apply these methods to the Default dataset from our lecture

on classification with logistic regression.

Review: the Default dataset

The Default dataset is located in the ISLR package.

Simulated observations from 10,000 customers on four features:

default (Yes = defaulted, No = did not default)

student (Yes = student, No = not student)

balance (average balance after monthly payment)

income (annual income)

Our goal is to predict default from the other three features.

Cross-validated ridge and lasso logistic regression in R

Recall that to perform logistic regression, we coded the two default

classes (Yes and No) as 1 and 0:

> Default.recoded <- Default %>%

mutate(default = ifelse(default == "Yes", 1, 0),

student = ifelse(student == "Yes", 1, 0))

We will use these recoded data again for consistency, so that we can compare our results here to those obtained earlier.

Creating the response and input feature matrices

Recall that glmnet requires that response and feature datasets be

provided separately and as matrices instead of data frames.

We can easily convert a data frame to a matrix with the as.matrix() function in R.

Using this function, we can obtain the response (𝑌) and feature (𝑋) matrices for use in the glmnet package:

> Y <- Default.recoded %>% select(default) %>% as.matrix()

> X <- Default.recoded %>%

select(balance, income, student) %>%

as.matrix()

The response and input feature matrices

We can use the head() function to peek at the first few rows of the

response and feature matrices:

> head(Y)

default

[1,] 0

[2,] 0

[3,] 0

[4,] 0

[5,] 0

[6,] 0

> head(X)

balance income student

[1,] 729.5265 44361.625 0

[2,] 817.1804 12106.135 1

[3,] 1073.5492 31767.139 0

[4,] 529.2506 35704.494 0

[5,] 785.6559 38463.496 0

[6,] 919.5885 7491.559 1

Specifying the set of tuning parameter (𝜆) values

Recall that our goal is to identify the “best” 𝜆 and corresponding model through cross-validation.

Thus, we need to ensure that we explore a range of 𝜆 values that is wide enough to capture all important levels of model complexity.

We will consider 100 lambda values that span a range from 10−6 to 106 on a logarithmic scale.

To do this, we will use the seq() function together with some

transformations:

> lambdas <- 10^seq(-6, 6, length.out = 100)

Obtaining ridge parameter estimates as a function of 𝜆

We can examine the set of all ridge regression parameter estimates as a function of the 𝜆 values in lambdas.

Specifically, we want to use ridge regression (𝛼 = 0) to estimate the four model parameters corresponding to the intercept and coefficients for balance, income, and student for each of the 100 𝜆 values.

To do this, we use the glmnet() function and set the argument

family to "binomial" to indicate logistic regression, the argument

alpha (𝛼) to zero for ridge regression, and provide the argument lambda (𝜆) our set of tuning parameter values lambdas:

> ridge.fit <- glmnet(X, Y, family = "binomial", alpha = 0,

lambda = lambdas)

Plotting parameter estimates as a function of 𝜆

Recall that glmnet comes with its own useful plotting functions.

Let us view how parameter estimates shrink as 𝜆 increases:

> plot(ridge.fit, xvar = "lambda")

> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))

Using 𝑘-fold CV to fit a ridge logistic regression model

We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.

Specifically, we want to use ridge regression (𝛼 = 0) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.

To do this, we use the cv.glmnet() function and set the argument

family to "binomial" to indicate logistic regression, the argument

alpha (𝛼) to zero for ridge regression, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds

(𝑘) argument equal to 10:

> ridge.cv <- cv.glmnet(X, Y, family = "binomial", alpha = 0,

lambda = lambdas, nfolds = 10)

Plotting the cross-validation error rate as a function of 𝜆

We can view the cross-validation error rate for ridge logistic

regression as a function of 𝜆 with the command:

> plot(ridge.cv)

Numbers at top indicate the number of selected features, which take

values of 0,1,…,𝑝 (ridge regression always retrains all 𝑝 features).

Estimating the best model (simplest with smallest CV error)

We know how to find the model with the smallest CV error rate.

However, sometimes it is better to obtain the 𝜆 of the simplest model that is within one standard error of (i.e., not statistically different from)

the best model.

Given our cross-validated object ridge.cv, we can access this 𝜆 using the command ridge.cv$lambda.1se.

We can then find the best-fit, yet simple, ridge logistic regression

model by typing:

> ridge.1se <- glmnet(X, Y, family = "binomial", alpha = 0,

lambda = ridge.cv$lambda.1se)

Obtaining parameter estimates

We can view the parameter estimates using the coef() function:

> coef(ridge.1se)

4 x 1 sparse Matrix of class "dgCMatrix"

s0

(Intercept) -8.128411e+00

balance 3.844218e-03

income 4.968183e-06

student -2.279153e-01

These are similar to our logistic regression estimates:

> glm.mult.fit

Call: glm(formula = default ~ balance + income + student, family = "binomial",

data = Default.recoded)

Coefficients:

(Intercept) balance income student

-1.087e+01 5.737e-03 3.033e-06 -6.468e-01

Degrees of Freedom: 9999 Total (i.e. Null); 9996 Residual

Null Deviance: 2921

Residual Deviance: 1572 AIC: 1580

Predicting probabilities with the fitted model

Given our fitted model ridge.1se, we can now make predictions

from new data with the predict() function in glmnet.

We can predict probabilities of default with:

> predict(ridge.1se, as.matrix(tibble(balance = 1000,

income = 45000, student = 1)), type = "response",

s = ridge.cv$lambda.1se)

1

[1,] 0.01353914

This is about an order of magnitude larger than for logistic regression:

> predict(glm.mult.fit, tibble(balance = 1000,

income = 45000, student = 1), type = "response")

1

0.003530389

Predicting classes with the fitted model

We can also predict classes instead of probabilities by setting the

argument type to "class" instead of "response".

We can therefore find the predicted class as:

> predict(ridge.1se, as.matrix(tibble(balance = 1000,

income = 45000, student = 1)), type = "class",

s = ridge.cv$lambda.1se)

1

[1,] "0"

This is sensible as the predicted default probability associated with

this class < 0.5.

Adding predicted probabilities and classes to the data frame

We may also wish to add predicted class probabilities and classes to

our original data frame:

> estProbs <- predict(ridge.1se, X, type = "response",

s = ridge.cv$lambda.1se)

> estClass <- predict(ridge.1se, X, type = "class",

s = ridge.cv$lambda.1se)

> Default.recoded.withProbs <- Default.recoded %>%

mutate(probs = c(estProbs), pred = c(estClass))

We can peek at this modified data frame using the command:

> head(Default.recoded.withProbs)

default student balance income probs pred

1 0 0 729.5265 44361.625 0.006038429 0

2 0 1 817.1804 12106.135 0.005738740 0

3 0 0 1073.5492 31767.139 0.020966741 0

4 0 0 529.2506 35704.494 0.002687456 0

5 0 0 785.6559 38463.496 0.007267246 0

6 0 1 919.5885 7491.559 0.008293071 0

Obtaining a confusion matrix and accuracy

A confusion matrix for our ridge logistic regression model fit to training

data is obtained by typing:

> Default.recoded.withProbs %>%

select(default, pred) %>%

table()

pred

default 0 1

0 9658 9

1 284 49

Our training accuracy can be obtained by typing:

> Default.recoded.withProbs %>%

summarize(accuracy = mean(pred == default))

accuracy

1 0.9707

Obtaining lasso parameter estimates as a function of 𝜆

The procedure for performing lasso regression in glmnet is akin to

ridge, except we set the argument alpha (𝛼) to one instead of zero:

> lasso.fit <- glmnet(X, Y, family = "binomial", alpha = 1,

lambda = lambdas)

> plot(lasso.fit, xvar = "lambda")

> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))

Using 𝑘-fold CV to fit a lasso logistic regression model

We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.

Specifically, we want to use lasso (𝛼 = 1) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.

To do this, we use the cv.glmnet() function and set the argument

family to "binomial" to indicate logistic regression, the argument

alpha (𝛼) to one for lasso regression, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds

(𝑘) argument equal to 10.

> lasso.cv <- cv.glmnet(X, Y, family = "binomial", alpha = 1,

lambda = lambdas, nfolds = 10)

Plotting the cross-validation error rate as a function of 𝜆

We can view the cross-validation error rate as a function of 𝜆 with the command:

> plot(lasso.cv)

Recall that numbers at the top show the number of selected features,

which take values of 0,1,…,𝑝.

Estimating the best model (simplest with smallest CV error)

Recall that it is sometimes better to obtain the 𝜆 of the simplest model that is within one standard error of (i.e., not statistically different from)

the best model.

Given our cross-validated object lasso.cv, we can access this 𝜆 using the command lasso.cv$lambda.1se.

We can then find this best-fit, yet simple, lasso logistic regression

model by typing:

> lasso.1se <- glmnet(X, Y, family = "binomial", alpha = 1,

lambda = lasso.cv$lambda.1se)

Obtaining parameter estimates

We can view the parameter estimates using the coef() function:

> coef(ridge.1se)

4 x 1 sparse Matrix of class "dgCMatrix"

s0

(Intercept) -8.128411e+00

balance 3.844218e-03

income 4.968183e-06

student -2.279153e-01

These are close to our logistic regression estimates:

> glm.mult.fit

Call: glm(formula = default ~ balance + income + student, family = "binomial",

data = Default.recoded)

Coefficients:

(Intercept) balance income student

-1.087e+01 5.737e-03 3.033e-06 -6.468e-01

Degrees of Freedom: 9999 Total (i.e. Null); 9996 Residual

Null Deviance: 2921

Residual Deviance: 1572 AIC: 1580

Predicting class probabilities and predictions

Let us also predict class probabilities and labels, and add these

predictions to our original data frame in one step:

> estProbs <- predict(lasso.1se, X, type = "response",

s = lasso.cv$lambda.1se)

> estClass <- predict(lasso.1se, X, type = "class",

s = lasso.cv$lambda.1se)

> Default.recoded.withProbs <- Default.recoded %>%

mutate(probs = c(estProbs), pred = c(estClass))

We can peek at this modified data frame using the command:

> head(Default.recoded.withProbs)

default student balance income probs pred

1 0 0 729.5265 44361.625 0.004059856 0

2 0 1 817.1804 12106.135 0.005832021 0

3 0 0 1073.5492 31767.139 0.016725900 0

4 0 0 529.2506 35704.494 0.001771389 0

5 0 0 785.6559 38463.496 0.005120073 0

6 0 1 919.5885 7491.559 0.008895485 0

Obtaining a confusion matrix and accuracy

A confusion matrix for our lasso logistic regression model fit to

training data is obtained by typing:

> Default.recoded.withProbs %>%

select(default, pred) %>%

table()

pred

default 0 1

0 9651 16

1 273 60

Our training accuracy can be obtained by typing:

> Default.recoded.withProbs %>%

summarize(accuracy = mean(pred == default))

accuracy

1 0.9711

Logistic regression with 𝐾 > 2 classes

Recall that logistic regression models the probability that 𝑌 is a particular class 𝑗 given features 𝑋 = (𝑋1,𝑋2,…,𝑋𝑝) with the logistic

function:

𝑃 𝑌 = 𝑗|𝑋 = 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑝𝑋𝑝

1 + 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑝𝑋𝑝

Thus far, we have only considered 𝐾 = 2 classes.

Logistic regression with 𝐾 > 2 classes is multinomial regression.

We can implement a penalized version of multinomial regression using the glmnet() function, except we choose the family to be

"multinomial" instead of "binomial".

Review: the mpg data frame

234 observations on 11 features for 38 models of cars:

manufacturer Car manufacturer

model Model name

displ Engine displacement (in liters)

year Year of manufacture

cyl Number of cylinders

trans Type of transmission

drv f=front-wheel, r=rear wheel, 4=four wheel drive

cty City miles per gallon

hwy Highway miles per gallon

fl Fuel type

class Type of car

We will perform penalized logistic (multinomial) regression to predict drv (𝐾 = 3) from 𝑝 = 3 quantitative features (displ, cty, and hwy).

R example of multinomial regression classifier

Let’s first convert drv values of f, r, and 4 to classes 1, 2, and 3:

> mpg.recoded <- mpg %>%

mutate(drv = ifelse(drv == "f", 1,

ifelse(drv == "r", 2, 3)))

> mpg.recoded

# A tibble: 234 x 11

manufacturer model displ year cyl trans drv cty hwy fl class

<chr> <chr> <dbl> <int> <int> <chr> <dbl> <int> <int> <chr> <chr>

1 audi a4 1.8 1999 4 auto(l5) 1 18 29 p compact

2 audi a4 1.8 1999 4 manual(m5) 1 21 29 p compact

3 audi a4 2 2008 4 manual(m6) 1 20 31 p compact

4 audi a4 2 2008 4 auto(av) 1 21 30 p compact

5 audi a4 2.8 1999 6 auto(l5) 1 16 26 p compact

6 audi a4 2.8 1999 6 manual(m5) 1 18 26 p compact

7 audi a4 3.1 2008 6 auto(av) 1 18 27 p compact

8 audi a4 quattro 1.8 1999 4 manual(m5) 3 18 26 p compact

9 audi a4 quattro 1.8 1999 4 auto(l5) 3 16 25 p compact

10 audi a4 quattro 2 2008 4 manual(m6) 3 20 28 p compact

# ... with 224 more rows

Creating the response and input feature matrices

Recall that glmnet requires that response and feature datasets be

provided separately and as matrices instead of data frames.

We will again use the as.matrix() function to convert data frames

to matrices.

Using this function, we obtain the response (𝑌) and feature (𝑋) matrices as:

> Y <- mpg.recoded %>% select(drv) %>% as.matrix()

> X <- mpg.recoded %>% select(displ, cty, hwy) %>% as.matrix()

The response and input feature matrices

We next use the head() function to peek at the first few rows of the

response and feature matrices:

> head(Y)

drv

[1,] 1

[2,] 1

[3,] 1

[4,] 1

[5,] 1

[6,] 1

> head(X)

displ cty hwy

[1,] 1.8 18 29

[2,] 1.8 21 29

[3,] 2.0 20 31

[4,] 2.0 21 30

[5,] 2.8 16 26

[6,] 2.8 18 26

Using penalized multinomial regression to predict drv

We will use penalized multinomial regression to develop a classifier to

predict whether a car uses front-wheel, rear-wheel, or four-wheel

drive based on its engine size and city and highway mileage.

We will fit this penalized multinomial model with a lasso penalty using

the command:

> lasso.fit <- glmnet(X, Y, family = "multinomial",

alpha = 1, lambda = lambdas)

If we wanted to perform ridge regression, then we would have instead set alpha = 0.

Obtaining lasso parameter estimates as a function of 𝜆

> plot(lasso.fit, xvar = "lambda")

> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))

Obtaining lasso parameter estimates as a function of 𝜆

> plot(lasso.fit, xvar = "lambda")

> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))

Using 𝑘-fold CV to fit a lasso logistic regression model

We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.

Specifically, we want to use lasso (𝛼 = 1) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.

To do this, we use the cv.glmnet() function and set the argument

family to "multinomial" to indicate multinomial regression, the

argument alpha (𝛼) to one for lasso regression, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds (𝑘) argument equal to 10:

> lasso.cv <- cv.glmnet(X, Y, family = "multinomial",

alpha = 1, lambda = lambdas,

nfolds = 10)

Plotting the cross-validation error rate as a function of 𝜆

We can view the cross-validation error rate as a function of 𝜆 with the command:

> plot(lasso.cv)

Recall that numbers at the top show the number of selected features,

which take values of 0,1,…,𝑝.

Estimating the best model (simplest with smallest CV error)

Recall that it is sometimes it is better to obtain the 𝜆 of the simplest model that is within one standard error of (i.e., not statistically

different from) the best model.

Given our cross-validated object lasso.cv, we can access this 𝜆 using the command lasso.cv$lambda.1se.

We can then find this best-fit, yet simple, lasso multinomial regression

model by typing:

> lasso.1se <- glmnet(X, Y, family = "multinomial",

alpha = 1,

lambda = lasso.cv$lambda.1se)

Obtaining parameter estimates

We can view the parameter estimates using the coef() function:

> coef(lasso.1se) $`1`

4 x 1 sparse Matrix of class "dgCMatrix"

s0

-2.2069847

displ .

cty .

hwy 0.1255963

$`2`

4 x 1 sparse Matrix of class "dgCMatrix"

s0

-4.463812

displ 1.081043

cty .

hwy .

$`3`

4 x 1 sparse Matrix of class "dgCMatrix"

s0

6.670797

displ .

cty .

hwy -0.249793

Estimating class probabilities

We may wish to know the probabilities of each of the 𝐾 = 3 classes, which we can estimate as:

> estProbs <- predict(lasso.1se, X, type = "response",

s = lasso.cv$lambda.1se)

We can peek at this set of estimated probabilities with the command:

> as_tibble(estProbs)

[1] # A tibble: 234 x 3

`1.1` `2.1` `3.1`

<dbl> <dbl> <dbl>

1 0.867 0.0166 0.116

2 0.867 0.0166 0.116

3 0.924 0.0171 0.0585

4 0.898 0.0189 0.0828

5 0.668 0.0551 0.277

6 0.668 0.0551 0.277

7 0.722 0.0726 0.205

8 0.694 0.0194 0.287

9 0.612 0.0194 0.369

10 0.818 0.0221 0.160

# ... with 224 more rows

Predicting classes and adding them to our data frame

Next, we predict classes and add them to our original data frame:

> estClass <- predict(lasso.1se, X, type = "class",

s = lasso.cv$lambda.1se)

> mpg.recoded.withPreds <- mpg.recoded %>%

mutate(pred = c(estClass))

We can peek at this modified data frame with the command:

> mpg.recoded.withPreds

# A tibble: 234 x 12

manufacturer model displ year cyl trans drv cty hwy fl class pred

<chr> <chr> <dbl> <int> <int> <chr> <dbl> <int> <int> <chr> <chr> <chr>

1 audi a4 1.8 1999 4 auto(l5) 1 18 29 p compact 1

2 audi a4 1.8 1999 4 manual(m5) 1 21 29 p compact 1

3 audi a4 2 2008 4 manual(m6) 1 20 31 p compact 1

4 audi a4 2 2008 4 auto(av) 1 21 30 p compact 1

5 audi a4 2.8 1999 6 auto(l5) 1 16 26 p compact 1

6 audi a4 2.8 1999 6 manual(m5) 1 18 26 p compact 1

7 audi a4 3.1 2008 6 auto(av) 1 18 27 p compact 1

8 audi a4 quattro 1.8 1999 4 manual(m5) 3 18 26 p compact 1

9 audi a4 quattro 1.8 1999 4 auto(l5) 3 16 25 p compact 1

10 audi a4 quattro 2 2008 4 manual(m6) 3 20 28 p compact 1

# ... with 224 more rows

Obtaining a confusion matrix and accuracy

A confusion matrix for our lasso multinomial regression model fit to

training data is obtained by typing:

> mpg.recoded.withPreds %>%

select(drv, pred) %>%

table()

pred

drv 1 2 3

1 97 1 8

2 4 5 16

3 23 1 79

Our training accuracy can be obtained by typing:

> mpg.recoded.withPreds %>%

summarize(accuracy = mean(pred == drv))

# A tibble: 1 x 1

accuracy

<dbl>

1 0.774

Summary of this lecture

We assess test accuracy of models with cross-validation, which involves

randomly and repeatedly splitting the data into training and test sets,

and then computing mean performance across iterations.

There are two major types of cross-validation: 𝑘-fold and LOOCV.

In feature selection, we try to minimize test error while also minimizing

the number of features used to make predictions.

We discussed three important algorithms for “traditional” feature

selection: best subset, forward, and background selection.

We also discussed three regularization approaches that “shrink”

parameter estimates toward zero: ridge, lasso, and elastic net.