regression using R
Review: training and test error
We can evaluate the prediction error of our model either on the
training dataset (known as training error) or on a dataset not used
for training the model (known as test error).
When training (fitting) a supervised learning model, the algorithm’s
goal is to minimize training error.
However, what we really want to do is to minimize the test error, or
the prediction error on data we have not observed yet.
Overfitting is an issue, because we can begin modeling noise in the
training data rather than only the underlying function.
This can lead to low training error, but high test error.
Review: illustration of training vs. test error
Below is prediction error as a function of polynomial degree 𝑀 for both training and test datasets:
Training error always decreases with increasing model complexity
(flexibility), whereas test error has a U-shaped curve.
Bishop (2006) Pattern Recognition and Machine Learning. Springer.
Prediction
error
How do we
identify 𝑴 such that the inferred
model has good
test error?
How do we minimize test error?
We have two main goals:
1) Model selection – comparing the performance of different
models to select the best one
2) Model assessment – estimating test error of the selected model
If we have a lot of data, then the optimal approach for both of these is
to divide the dataset uniformly at random into three parts:
Training set
Validation set
Test set
Training, validation, and test sets
Training set: used for fitting the models
Validation set: used to estimate prediction error for model selection
Test set: used for assessing prediction (test) error on new data
The test set should only be used at the end of the analysis for model
assessment, and not for choosing the best model.
There is no general rule for how to split the dataset, but a typical split
is 50% for training, 25% for validation, and 25% for test.
Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.
Do we need to split data into three parts?
For many datasets it may not be possible.
That is, there may not be enough data to create a separate test
dataset without decreasing the sample size such that it is deleterious
to model selection.
Model selection is critical, and often our primary concern is training a
method that will perform as well as possible on unseen data.
We will therefore consider methods that split the dataset into only
training and validation sets.
In these methods, the validation set is used for estimating test error.
The validation (hold-out) set approach
The validation (or hold-out) set approach involves dividing the training
dataset uniformly at random into roughly two equally-sized parts: the
training set and the validation set (known as the hold-out set).
The model is rained on the traning se
The model is trained on the training set (blue), and the test error of
the trained model is estimated from the validation set (orange).
Models of different complexity can be trained and their test error
evaluated, and then the best-performing model can be selected.
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Validation set approach on polynomial regression
Validation (test) error for a single split into training and validation sets:
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Validation set approach on polynomial regression
Validation (test) error for different splits of the data:
Test error is highly variable across splits, and depends on which
observations are included in the training vs. validation sets.
This variation occurs because the validation set approach trains the
model on only half of the data (small sample).
Methods tend to perform worse when trained on fewer observations,
suggesting validation error overestimates test error on entire dataset. James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Overcoming pitfalls of the validation set approach
We need the learned model to be more robust to variability in the
choice of training and validation set.
We also want the trained model to be based on a large sample size.
The solution to this problem is cross-validation (CV), which involves
repeatedly splitting the data into training and test sets, and then
assessing mean performance across iterations.
There are tow major CV approaches:
1) Leave-one-out cross-validation (LOOCV)
2) 𝑘-fold cross validation
Leave-one-out cross-validation (LOOCV)
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Leave-one-out cross-validation (𝐋𝐎𝐎𝐂𝐕) splits the data into a training set of 𝑛 − 1 observations and a validation set of the remaining 1 observation:
Leave-one-out cross-validation (LOOCV)
We start by removing training observation (𝑥1,𝑦1), and using only observations {(𝑥2,𝑦2),(𝑥3,𝑦3),…,(𝑥𝑛,𝑦𝑛)} for training.
The model is fit to the training data, and a prediction is made for
training observation (𝑥1,𝑦1).
This validation (test) error of training observation (𝑥1,𝑦1) is:
MSE1 = 𝑦1 − ො𝑦1 2
This estimate of the MSE of the method will be highly variable
because it is based on only a single observation.
Leave-one-out cross-validation (LOOCV)
We repeat this process for each of the 𝑛 training observations, such that we remove training observation (𝑥𝑖,𝑦𝑖), and use observations { 𝑥1,𝑦1 ,…, 𝑥𝑖−1,𝑦𝑖−1 , (𝑥𝑖+1,𝑦𝑖+1),…,(𝑥𝑛,𝑦𝑛)} for training.
The model is fit to the training data, and a prediction is made for
training observation (𝑥𝑖,𝑦𝑖).
The validation (test) error of training observation (𝑥𝑖,𝑦𝑖) is:
MSE𝑖 = 𝑦𝑖 − ො𝑦𝑖 2
When we have computed MSE for all 𝑛 training observations, we compute the LOOCV validation (test) error as:
CV(𝑛) = MSE1 + MSE2 + ⋯+ MSE𝑛
𝑛
Advantages of LOOCV over validation set
LOOCV has substantially less bias because the model is repeatedly
trained on a large sample set of size 𝑛 − 1 training observations, whereas the validation set approach trains models with a sample size
of approximately 𝑛/2 training observations.
Because of this, LOOCV tends to not overestimate the test error.
Moreover, LOOCV always yields the same test error and inferred
model, compared to variation attributed to different splits of the data
with the validation set approach.
However, LOOCV is computationally expensive, because the model
needs to be trained 𝑛 times.
LOOCV approach on polynomial regression
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
𝑘-fold CV
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
𝒌-fold cross-validation (𝒌−fold 𝐂𝐕) randomly splits data into 𝑘 “folds”, with a training set of 𝑘 − 1 folds and a test set of the remaining 1 fold.
𝑘-fold cross-validation (CV)
This process is repeated for each set (just like with LOOCV):
When set 𝑖 is held out for validation, the remaining 𝑘 − 1 sets (sets 1,…,𝑖 − 1,𝑖 + 1,…,𝑘) are used for training.
The MSE across the observations in the 𝑖th set is used to measure the validation (test) error, and is denoted by MSE𝑖, and the overall 𝑘- fold CV test error is computed as
CV(𝑘) = MSE1 + MSE2 + ⋯+ MSE𝑘
𝑘
Computational savings with 𝑘-fold CV
LOOCV is a special case of 𝑘-fold CV with 𝑘 = 𝑛.
Though LOOCV is superior to 𝑘-fold CV (𝑘 < 𝑛), it is also computationally expensive to train the model 𝑛 times.
In practice, we usually use 𝑘 = 5 or 𝑘 = 10 for 𝑘-fold CV.
This makes 𝑘-fold CV (𝑘 < 𝑛) much less computationally intensive than LOOCV.
In such cases, we only have to train 5 or 10 models, rather than 𝑛, which could lead to enormous savings for methods that are
computationally intensive to fit.
LOOCV vs. 10-fold CV on polynomial regression
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
The overall performance is similar between LOOCV and 10-fold CV:
Though there is some variability across the 10-fold CV estimates, it is much lower than the variability from the validation set approach.
Non-computational advantages of 𝑘-fold CV (𝑘 < 𝑛)
Because LOOCV uses a greater number of observations in each
training set, its model estimates will be based on a larger sample
size, and should exhibit smaller bias relative to 𝑘-fold CV (𝑘 < 𝑛).
However, LOOCV has a higher variance than 𝑘-fold CV (𝑘 < 𝑛).
LOOCV is essentially averaging the output of 𝑛 fitted models, each of which is highly positively correlated with the others (as almost all
training observations overlap).
𝑘-fold CV is averaging the output of 𝑘 fitted models, each of which is partially correlated with the others.
Non-computational advantages of 𝑘-fold CV (𝑘 < 𝑛)
The mean of many highly correlated quantities has higher variance
than the mean of many quantities that are less correlated.
Because of this, the test error estimates resulting from 𝑘-fold CV (𝑘 < 𝑁) tend to be smaller than those from LOOCV.
Therefore, there is a bias-variance trade-off depending on the chosen
value of 𝑘 in 𝑘-fold CV.
For these reasons, 5- and 10-fold CV is often performed, as they
have been demonstrated empirically to provided estimates of the test
error that do not experience high bias or high variance.
Feature selection
A related problem to model selection is feature selection, which
involves determining which features are associated with the response
so that we can fit a model with only those features.
In feature selection, we seek to minimize test error, while also
minimizing the number of features used to make predictions.
After feature selection, chosen features would hopefully be the
important ones for making predictions about the response.
Irrelevant features not associated with the response are discarded.
Redundant (correlated) features are also discarded.
Why perform feature selection?
Feature selection can enhance interpretability of the fitted model
because fewer features are associated with the response.
The reduction of model parameters may lead to better generalization
or test error due to a reduction in overfitting.
Including fewer features can reduce the time it takes to train models,
as training time increases with increased dimensionality (more
complexity/ features).
Subset selection methods
We begin by considering subset selection methods.
These methods try to identify a subset of the 𝑝 features that will lead to reduced prediction error.
We will consider three approaches:
1) Best subset selection – compare all possible subsets of features
2) Forward selection – start with null model (no features) and add
3) Backward selection – start with full model (all features) and subtract
Best subset selection
1. Let ℳ0 denote the null model that contains no features, which is the model with only parameter 𝛽0.
2. For each 𝑘 = 1,2,…,𝑝:
(a) Fit all models of exactly 𝑘 features.
(b) Pick the best (smallest RSS) among these models, and call it
ℳ𝑘.
3. Select a single best model from {ℳ0,ℳ1,…ℳ𝑝} based on model
selection (e.g., cross-validation).
Adapted from James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
This is the ideal approach because we are comparing every
possible subset of features!
Illustration of best subset selection
RSS for dataset with 𝑝 = 8 features:
Choose model with lowest RSS among all sets of models.
The best model contains all 𝑝 = 8 features in the training set.
Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.
Problem with best subset selection
In best subset selection, we must perform least squares regression
for all subsets with 0,1,2,…,𝑝 features.
The set of all subsets is known as the power set, which is
exponentially increasing in size of the original set.
Because we have 𝑝 features, the total number of fitted models is 2𝑝, which is enormous, and computationally infeasible for anything but a
small number of features (maximum 𝑝 between 30 and 40 based on an efficient algorithm of Furnival and Wilson 1974).
For example, even when 𝑝 = 20 (small number of features for modern datasets), we must consider 220 = 1,048,576 models!
Furnival, Wilson (1974) Technometrics 16:499-511.
How can we overcome this computational burden?
Best subset selection is simply not computationally feasible for many
contemporary datasets.
The other two approaches that we will discuss examine a far more
restricted space compared to best subset selection:
2) Forward selection – start with null model (no features) and add
3) Backward selection – start with full model (all features) and subtract
Forward selection
1. Let ℳ0 denote the null model that contains no features, which is the model with only parameter 𝛽0.
2. For each 𝑘 = 0,1,…,𝑝 − 1:
(a) Fit all 𝑝 − 𝑘 models that augment the features in ℳ𝑘 with one additional feature.
(b) Pick the best (smallest RSS) among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1.
3. Select a single best model from {ℳ0,ℳ1,…ℳ𝑝} based on model
selection (e.g., cross validation).
Adapted from James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Backward selection
1. Let ℳ𝑝 denote the full model that contains all 𝑝 features.
2. For each 𝑘 = 𝑝,𝑝 − 1,…,1:
(a) Fit all 𝑘 models that have all but one of the features in ℳ𝑘, for a total of 𝑘 − 1 features.
(b) Pick the best (smallest RSS) among these 𝑘 models, and call it ℳ𝑘−1.
3. Select a single best model from {ℳ0,ℳ1,…ℳ𝑝} based on model
selection (e.g., cross validation).
Adapted from James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Number of fitted models with forward & backward selection
The number of fitted models for either forward selection or backward
selection is:
1 + 𝑝 𝑝 + 1
2
Thus, they are equivalent in computational efficiency.
This number is also much smaller than the 2𝑝 models required by best subset selection.
For example, with 𝑝 = 20, best subset selection requires 1,048,576 fitted models, while forward and backward selection require only 211.
How to choose between forward & backward selection?
First, it is important to point out that both methods may not yield the
optimal solution, which can only be found by best subset selection.
The optimal solutions for forward and backward stepwise selection
also may differ.
Backward subset selection requires 𝑝 ≤ 𝑛, so that the full model can be fit without issue.
Forward subset selection can be used even when 𝑝 > 𝑛, as the algorithm can just stop at the subset of size 𝑘 = 𝑛 features.
However, forward selection is “greedy” and might include features
early that later become less important.
Can we get closer to best subset selection?
A hybrid approach between forward and backward stepwise selection
can be taken to get a better approximation to best subset selection.
With such an approach, features are added iteratively through
forward stepwise selection.
After each new feature is added, the method can remove any feature
from the model that no longer improves model fit.
Therefore, features added earlier in the process may be removed if
they are no longer important conditional on the set of added features
that were added later in the process.
This alleviates the “greedy” issue of forward selection.
Comparison of subset selection approaches
𝑁 = 30 observations with 𝑝 = 31 standard normal features with pairwise correlations of 0.85:
Adapted from Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.
Shrinkage methods as alternative to subset selection
Subset selection methods retain a subset of the features and discard
the remainder, often leading to a more interpretable model.
However, this discrete process of retaining or discarding features
often leads to high variance, and therefore does not reduce the
prediction error as much as hoped from the full model.
An alternative approach is to fit a model with all features, then shrink
estimated parameters toward zero relative to least squares estimates.
This can reduce variance of estimates, as well as potentially remove
(estimate parameters to be zero) unimportant features.
This approach is called regularization because it constrains, or
regularizes, parameter estimates.
Regularization techniques
Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by
minimizing the cost function 𝐽 𝛽 :
𝐽 𝛽 = RSS =
𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖) 2
Regularization shrinks estimates by minimizing a penalized variation.
We will discuss three common techniques:
1) Ridge regression
2) Lasso
3) Elastic net
Regularization techniques
Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by
minimizing the cost function 𝐽 𝛽 :
𝐽 𝛽 = RSS =
𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖) 2
Regularization shrinks estimates by minimizing a penalized variation.
We will discuss three common techniques:
1) Ridge regression
2) Lasso
3) Elastic net
Ridge regression
Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by
minimizing the cost function 𝐽 𝛽 :
𝐽 𝛽 = RSS =
𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖) 2
Ridge regression instead minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗 2 ,
where the second term is a shrinkage (L2) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.
How ridge regression works
Ridge regression minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗 2 ,
where the second term is a shrinkage (L2) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.
The shrinkage penalty is smallest when 𝛽1 …𝛽𝑝 are close to zero, thus
shrinking their estimates close to zero.
The shrinkage penalty has no influence when 𝜆 = 0 (least squares regression), and increasing influence as 𝜆 → ∞.
Choosing the tuning parameter 𝜆.
The original formulation of least squares regression minimizes the cost
function 𝐽(𝛽) = RSS, leading to a single set of parameter estimates.
However, ridge regression leads to different sets of estimates across
values of the tuning parameter 𝜆.
We choose the best set of parameter estimates under ridge regression
through cross-validation.
That is, we identify መ𝜆 through cross-validation that leads to the best መ𝛽.
We then estimate parameters based on this tuning parameter መ𝜆.
Example behavior of parameter estimates with increasing 𝜆
Parameter estimates (coefficients) for predicting credit card balance
from income, credit limit, credit rating, and student status as a
function of 𝜆 :
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Tuning parameter creates bias-variance tradeoff
The tuning parameter 𝜆 modifies model complexity.
That is, as the tuning parameter increases, model complexity
decreases, and we have fewer effective parameters.
This leads to a greater bias but a decreased variance.
The optimal 𝜆 value is chosen such that there is an optimal tradeoff between bias and variance, yielding improved test error.
Choosing the optimal value of the tuning parameter 𝜆
Depicted are MSEs (purple), squared bias (black), variance (green),
and irreducible error (dotted black) for different values of 𝜆:
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Optimal MSE (purple X)
occurs at intermediate
value of 𝜆
Advantages of ridge regression over other approaches
Least squares estimates have increasing variance as 𝑝 → 𝑛, and do not even have unique solutions when 𝑝 > 𝑛.
Ridge regression can still perform well when 𝑝 > 𝑛, trading small increases in bias for potentially larger decreases in variance.
Moreover, we must fit 2𝑝 models for best subset selection, and 1 + 𝑝(𝑝 + 1)/2 models for forward and backward selection.
In contrast, ridge regression only requires us to fit a single model for
any value of 𝜆, which can be done quickly.
But there is one obvious disadvantage to ridge regression relative to
all subset selection methods—it includes all 𝑝 features in the model.
Regularization techniques
Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by
minimizing the cost function 𝐽 𝛽 :
𝐽 𝛽 = RSS =
𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖) 2
Regularization shrinks estimates by minimizing a penalized variation.
We will discuss three common techniques:
1) Ridge regression
2) Lasso
3) Elastic net
Lasso is very similar in form to ridge regression
Ridge regression minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗 2 ,
where the second term is a shrinkage (L2) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.
Lasso instead minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗 ,
where the second term is a shrinkage (L1) penalty, and 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization.
Lasso also works similarly to ridge regression
Lasso minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗 ,
where the second term is a shrinkage (L1) penalty, and 𝜆 ≥ 0 is a tuning parameter that is selected by cross-validation.
As with ridge regression, the shrinkage penalty is smallest when 𝛽1 …𝛽𝑝 are close to zero, and has increasing influence as 𝜆 → ∞.
But unlike ridge regression, lasso can perform feature selection by
forcing some estimates to be exactly zero when is 𝜆 sufficiently large.
Tuning parameter creates bias-variance tradeoff
The tuning parameter 𝜆 modifies model complexity.
That is, as the tuning parameter increases, model complexity
decreases, and we have fewer effective parameters.
This leads to a greater bias but a decreased variance.
In addition, the decreased variance is associated with some features
becoming discarded as the tuning parameter increases in value.
The optimal 𝜆 value is chosen such that there is an optimal tradeoff between bias and variance, yielding improved test error.
Depicted are MSEs (purple), squared bias (black), and variance
(green) for different values of 𝜆:
Tuning parameter makes bias-variance tradeoff
James et al (2017) An Introduction to Statistical Learning: with Applications in R. Springer.
Optimal MSE (purple X)
again occurs at
intermediate value of 𝜆
Ridge regression vs. lasso: Which is better?
Some of the same advantages of ridge regression also apply to lasso
(perform well when 𝑝 > 𝑛 unlike least squares, only need to fit a single model for each 𝜆 unlike subset selection methods).
Though ridge regression and lasso may outcompete these other
methods, neither one is always better than the other.
Ridge regression will generally perform better when the response is a
function of many features (because it must include all features).
Lasso will generally perform better when the response is a function of
few features (because it can perform feature selection).
However, we do not typically know how many features are related to the
response a priori for real data.
Comparison of approaches for prostate cancer prediction
Parameter estimates and test errors for least squares (LS), best
subset selection, ridge regression, and lasso:
Adapted from Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.
Comparison of approaches for prostate cancer prediction
Parameter estimates and test errors for least squares (LS), best
subset selection, ridge regression, and lasso:
Which approach is best? What does this say about the data?
Adapted from Hastie et al (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd Ed.
?
What if the best solution lies between ridge and lasso?
The best solution may not be the one determined by ridge regression,
or by lasso, but rather may be somewhere in between the two.
In such cases, combining ridge regression and lasso may increase
predictive ability.
This solution, termed elastic net, was proposed such that both ridge
regression and lasso are nested models within it.
Regularization techniques
Recall that in least squares, we estimate parameters 𝛽0 …𝛽𝑝 by
minimizing the cost function 𝐽 𝛽 :
𝐽 𝛽 = RSS =
𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖) 2
Regularization shrinks estimates by minimizing a penalized variation.
We will discuss three common techniques:
1) Ridge regression
2) Lasso
3) Elastic net
Elastic net combines ridge regression and lasso
Ridge regression minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗 2
Lasso minimizes:
𝐽 𝛽,𝜆 = RSS + 𝜆
𝑗=1
𝑝
𝛽𝑗
Elastic net minimizes:
𝐽 𝛽,𝜆,𝛼 = RSS + 𝜆 1 − 𝛼
𝑗=1
𝑝
𝛽𝑗 2 + 𝛼
𝑗=1
𝑝
𝛽𝑗 ,
where the second term is a shrinkage penalty, 𝜆 ≥ 0 is a tuning parameter that determines its influence on minimization, and 0 ≤ 𝛼 ≤ 1 is another tuning parameter that determines the influence of ridge
regression and lasso on the estimated parameters.
Elastic net works similarly to ridge regression and lasso
Elastic net minimizes:
𝐽 𝛽,𝜆,𝛼 = RSS + 𝜆 1 − 𝛼
𝑗=1
𝑝
𝛽𝑗 2 + 𝛼
𝑗=1
𝑝
𝛽𝑗 ,
where the second term is a shrinkage penalty, and 𝜆 ≥ 0 and 0 ≤ 𝛼 ≤ 1 are tuning parameters that are selected by cross-validation.
As with ridge regression and lasso, the shrinkage penalty is smallest
when 𝛽1 …𝛽𝑝 are close to zero, and has increasing influence as 𝜆 → ∞.
The tuning parameter 𝛼 determines the influence of ridge regression and lasso on the estimated parameters.
We obtain ridge regression when 𝛼 = 0, and lasso when 𝛼 = 1.
Comparison of approaches for prostate cancer prediction
Parameter predictions and test errors for ordinary least squares
(OLS), ridge regression, lasso, naïve elastic net, and elastic net
(improved performance, citation below):
Note that 𝑠 is the fraction of the L1 norm 𝛽𝑗 , which is in the L1
penalty term in lasso.
Zou and Hastie (2005) J. R. Statist. Soc. B. 67:301-320.
Cross-validated ridge and lasso regression in R
R of course has the machinery for us to easily perform ridge and
lasso regression together with cross-validation to determine the
appropriate tuning parameter 𝜆 (i.e., model complexity).
We first need to install and load the library for the glmnet package,
which provides an implementation of elastic net (and therefore lasso
and ridge) regression together with cross-validation.
To install and load glmnet, use the command:
> install.packages("glmnet")
> library(glmnet)
We will apply these methods to the Advertising dataset from our
lecture on linear regression.
Review: Advertising dataset
We will next consider the Advertising dataset, which can be
downloaded from https://www.statlearning.com/resources-first-edition.
This dataset is tidied into a data frame with 𝑛 = 200 market observations (rows) on 𝑝 = 4 features (columns):
sales Sales (thousands of units)
TV TV advertising budget (thousands of dollars)
radio Radio advertising budget (thousands of dollars)
newspaper Newspaper advertising budget (thousands of dollars)
Our goal is to predict sales from the other three features.
Creating the response and input feature matrices
The glmnet package requires that response and feature datasets be
provided separately and as matrices instead of data frames.
We can easily convert a data frame to a matrix with the as.matrix() function in R.
Using this function, we can obtain the response (𝑌) and feature (𝑋) matrices for use in the glmnet package:
> Y <- Advertising %>% select(sales) %>% as.matrix()
> X <- Advertising %>% select(TV, radio, newspaper) %>%
as.matrix()
The response and input feature matrices
We can use the head() function to peek at the first few rows of the
response and feature matrices:
> head(Y)
sales
[1,] 22.1
[2,] 10.4
[3,] 9.3
[4,] 18.5
[5,] 12.9
[6,] 7.2
> head(X)
TV radio newspaper
[1,] 230.1 37.8 69.2
[2,] 44.5 39.3 45.1
[3,] 17.2 45.9 69.3
[4,] 151.5 41.3 58.5
[5,] 180.8 10.8 58.4
[6,] 8.7 48.9 75.0
Specifying the set of tuning parameter (𝜆) values
Our goal is to identify the “best” 𝜆 and corresponding model through cross-validation.
Thus, we need to ensure that we explore a range of 𝜆 values that is wide enough to capture all important levels of model complexity.
We will consider 100 𝜆 values that span a range from 0.001 to 1000 on a logarithmic scale.
To do this, we will use the seq() function together with some
transformations:
> lambdas <- 10^seq(-3, 3, length.out = 100)
Obtaining ridge parameter estimates as a function of 𝜆
We can examine the set of all ridge regression parameter estimates as a function of the 𝜆 values in lambdas.
Specifically, we want to use ridge regression (𝛼 = 0) to estimate the four model parameters corresponding to the intercept and coefficients for TV, radio, and newspaper for each of the 100 𝜆 values.
To do this, we use the glmnet() function and set the argument
alpha (𝛼) to zero and provide the argument lambda (𝜆) our set of tuning parameter values lambdas:
> ridge.fit <- glmnet(X, Y, alpha = 0, lambda = lambdas)
Plotting parameter estimates as a function of 𝜆
The package glmnet comes with its own useful plotting functions.
We can view how parameter estimates shrink as 𝜆 increases:
> plot(ridge.fit, xvar = "lambda")
> legend("topright", lwd=1, col = 1:3, legend = colnames(X))
Using 𝑘-fold CV to fit a ridge regression model
We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.
Specifically, we want to use ridge regression (𝛼 = 0) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.
To do this, we use the cv.glmnet() function and set the argument
alpha (𝛼) to zero, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds (𝑘) argument to 10:
> ridge.cv <- cv.glmnet(X, Y, alpha = 0, lambda = lambdas,
nfolds = 10)
Plotting the cross-validation error rate as a function of 𝜆
We can view the cross-validation error rate for ridge regression as a
function of 𝜆 with the command:
> plot(ridge.cv)
Numbers at the top show the number of selected features, which take
values of 0,1,…,𝑝 (ridge regression always retains all 𝑝 features).
Estimating the best model (smallest CV error rate)
We now would like to find the model with the smallest CV error rate.
To do so, we must first select the 𝜆 with the minimum CV error rate across the range of values that we explored.
Given our cross-validated object ridge.cv, we can access the 𝜆
with the minimum CV error rate using the command ridge.cv$lambda.min.
We can then find the best-fit ridge regression model by typing:
> ridge.best <- glmnet(X, Y, alpha = 0,
lambda = ridge.cv$lambda.min)
Obtaining parameter estimates
We can view the parameter estimates using the coef() function:
> coef(ridge.best)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 2.9906421059
TV 0.0455213515
radio 0.1874124145
newspaper -0.0007094929
This is almost identical to the least squares regression estimates:
> lm.fit
Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising)
Coefficients:
(Intercept) TV radio newspaper
2.938889 0.045765 0.188530 -0.001037
Making predictions with the fitted model
Given our fitted model ridge.best, we can now make predictions
from new data with the predict() function in glmnet.
We can make predictions from the best-fit ridge regression model as:
> predict(ridge.best, as.matrix(tibble(TV = 1000,
radio = 2000, newspaper = 3000)))
s0
[1,] 421.2083
This is almost identical to our prediction with least squares:
> predict(lm.fit, tibble(TV = 1000, radio = 2000,
newspaper = 3000))
1
422.6511
Obtaining lasso parameter estimates as a function of 𝜆
The procedure for performing lasso regression in glmnet is akin to
ridge, except we set the argument alpha (𝛼) to one instead of zero:
> lasso.fit <- glmnet(X, Y, alpha = 1, lambda = lambdas)
> plot(lasso.fit, xvar = "lambda")
> legend("topright", lwd=1, col = 1:3, legend = colnames(X))
Using 𝑘-fold CV to fit a lasso regression model
We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.
Specifically, we want to use lasso (𝛼 = 1) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.
To do this, we use the cv.glmnet() function and set the argument
alpha (𝛼) to one, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds (𝑘) argument to 10:
> lasso.cv <- cv.glmnet(X, Y, alpha = 1, lambda = lambdas,
nfolds = 10)
Plotting the cross-validation error rate as a function of 𝜆
We can view the cross-validation error rate of lasso as a function of 𝜆 with the command:
> plot(lasso.cv)
Recall that numbers at the top show the number of selected features,
which take values of 0,1,…,𝑝.
Estimating the best model (smallest CV error rate)
We now would like to find the model with the smallest CV error rate.
To do so, we must first select the 𝜆 with the minimum CV error rate across the range of values that we explored.
Given our cross-validated object ridge.cv, we can access the 𝜆
with the minimum CV error rate using the command ridge.cv$lambda.min.
We can then find the best-fit lasso regression model by typing:
> lasso.best <- glmnet(X, Y, alpha = 1,
lambda = lasso.cv$lambda.min)
Obtaining parameter estimates
We can view the parameter estimates using the coef() function:
> coef(lasso.best)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 3.07623499
TV 0.04520385
radio 0.18480819
newspaper .
These are similar to ridge regression with newspaper removed:
> coef(ridge.best)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 2.9906421059
TV 0.0455213515
radio 0.1874124145
newspaper -0.0007094929
Making predictions with the fitted model
Given our fitted model lasso.best, we can now make predictions
from new data with the predict() function in glmnet.
We can make predictions from the best-fit lasso regression model as:
> predict(lasso.best, as.matrix(tibble(TV = 1000,
radio = 2000, newspaper = 3000)))
s0
[1,] 417.8965
This is similar to our prediction from the best-fit ridge regression model:
> predict(ridge.best, as.matrix(tibble(TV = 1000,
radio = 2000, newspaper = 3000)))
s0
[1,] 421.2083
Application of regularization to classification problems
We can also apply penalization to logistic regression!
The ridge-penalized logistic regression cost function is:
𝐽 𝛽,𝜆 = −log𝐿(𝛽) + 𝜆
𝑗=1
𝑝
𝛽𝑗 2
The lasso-penalized logistic regression cost function is:
𝐽 𝛽,𝜆 = −log𝐿(𝛽) + 𝜆
𝑗=1
𝑝
𝛽𝑗
The elastic net-penalized logistic regression cost function is:
𝐽 𝛽,𝜆,𝛼 = −log𝐿(𝛽) + 𝜆 1 − 𝛼
𝑗=1
𝑝
𝛽𝑗 2 + 𝛼
𝑗=1
𝑝
𝛽𝑗
Cross-validated ridge and lasso logistic regression in R
Recall that R has the machinery for us to easily perform ridge and
lasso logistic regression together with cross-validation to determine
the appropriate tuning parameter 𝜆 (i.e., model complexity) using the elastic net model in the glmnet package.
To load glmnet, use the command:
> library(glmnet)
We will apply these methods to the Default dataset from our lecture
on classification with logistic regression.
Review: the Default dataset
The Default dataset is located in the ISLR package.
Simulated observations from 10,000 customers on four features:
default (Yes = defaulted, No = did not default)
student (Yes = student, No = not student)
balance (average balance after monthly payment)
income (annual income)
Our goal is to predict default from the other three features.
Cross-validated ridge and lasso logistic regression in R
Recall that to perform logistic regression, we coded the two default
classes (Yes and No) as 1 and 0:
> Default.recoded <- Default %>%
mutate(default = ifelse(default == "Yes", 1, 0),
student = ifelse(student == "Yes", 1, 0))
We will use these recoded data again for consistency, so that we can compare our results here to those obtained earlier.
Creating the response and input feature matrices
Recall that glmnet requires that response and feature datasets be
provided separately and as matrices instead of data frames.
We can easily convert a data frame to a matrix with the as.matrix() function in R.
Using this function, we can obtain the response (𝑌) and feature (𝑋) matrices for use in the glmnet package:
> Y <- Default.recoded %>% select(default) %>% as.matrix()
> X <- Default.recoded %>%
select(balance, income, student) %>%
as.matrix()
The response and input feature matrices
We can use the head() function to peek at the first few rows of the
response and feature matrices:
> head(Y)
default
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 0
> head(X)
balance income student
[1,] 729.5265 44361.625 0
[2,] 817.1804 12106.135 1
[3,] 1073.5492 31767.139 0
[4,] 529.2506 35704.494 0
[5,] 785.6559 38463.496 0
[6,] 919.5885 7491.559 1
Specifying the set of tuning parameter (𝜆) values
Recall that our goal is to identify the “best” 𝜆 and corresponding model through cross-validation.
Thus, we need to ensure that we explore a range of 𝜆 values that is wide enough to capture all important levels of model complexity.
We will consider 100 lambda values that span a range from 10−6 to 106 on a logarithmic scale.
To do this, we will use the seq() function together with some
transformations:
> lambdas <- 10^seq(-6, 6, length.out = 100)
Obtaining ridge parameter estimates as a function of 𝜆
We can examine the set of all ridge regression parameter estimates as a function of the 𝜆 values in lambdas.
Specifically, we want to use ridge regression (𝛼 = 0) to estimate the four model parameters corresponding to the intercept and coefficients for balance, income, and student for each of the 100 𝜆 values.
To do this, we use the glmnet() function and set the argument
family to "binomial" to indicate logistic regression, the argument
alpha (𝛼) to zero for ridge regression, and provide the argument lambda (𝜆) our set of tuning parameter values lambdas:
> ridge.fit <- glmnet(X, Y, family = "binomial", alpha = 0,
lambda = lambdas)
Plotting parameter estimates as a function of 𝜆
Recall that glmnet comes with its own useful plotting functions.
Let us view how parameter estimates shrink as 𝜆 increases:
> plot(ridge.fit, xvar = "lambda")
> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))
Using 𝑘-fold CV to fit a ridge logistic regression model
We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.
Specifically, we want to use ridge regression (𝛼 = 0) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.
To do this, we use the cv.glmnet() function and set the argument
family to "binomial" to indicate logistic regression, the argument
alpha (𝛼) to zero for ridge regression, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds
(𝑘) argument equal to 10:
> ridge.cv <- cv.glmnet(X, Y, family = "binomial", alpha = 0,
lambda = lambdas, nfolds = 10)
Plotting the cross-validation error rate as a function of 𝜆
We can view the cross-validation error rate for ridge logistic
regression as a function of 𝜆 with the command:
> plot(ridge.cv)
Numbers at top indicate the number of selected features, which take
values of 0,1,…,𝑝 (ridge regression always retrains all 𝑝 features).
Estimating the best model (simplest with smallest CV error)
We know how to find the model with the smallest CV error rate.
However, sometimes it is better to obtain the 𝜆 of the simplest model that is within one standard error of (i.e., not statistically different from)
the best model.
Given our cross-validated object ridge.cv, we can access this 𝜆 using the command ridge.cv$lambda.1se.
We can then find the best-fit, yet simple, ridge logistic regression
model by typing:
> ridge.1se <- glmnet(X, Y, family = "binomial", alpha = 0,
lambda = ridge.cv$lambda.1se)
Obtaining parameter estimates
We can view the parameter estimates using the coef() function:
> coef(ridge.1se)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -8.128411e+00
balance 3.844218e-03
income 4.968183e-06
student -2.279153e-01
These are similar to our logistic regression estimates:
> glm.mult.fit
Call: glm(formula = default ~ balance + income + student, family = "binomial",
data = Default.recoded)
Coefficients:
(Intercept) balance income student
-1.087e+01 5.737e-03 3.033e-06 -6.468e-01
Degrees of Freedom: 9999 Total (i.e. Null); 9996 Residual
Null Deviance: 2921
Residual Deviance: 1572 AIC: 1580
Predicting probabilities with the fitted model
Given our fitted model ridge.1se, we can now make predictions
from new data with the predict() function in glmnet.
We can predict probabilities of default with:
> predict(ridge.1se, as.matrix(tibble(balance = 1000,
income = 45000, student = 1)), type = "response",
s = ridge.cv$lambda.1se)
1
[1,] 0.01353914
This is about an order of magnitude larger than for logistic regression:
> predict(glm.mult.fit, tibble(balance = 1000,
income = 45000, student = 1), type = "response")
1
0.003530389
Predicting classes with the fitted model
We can also predict classes instead of probabilities by setting the
argument type to "class" instead of "response".
We can therefore find the predicted class as:
> predict(ridge.1se, as.matrix(tibble(balance = 1000,
income = 45000, student = 1)), type = "class",
s = ridge.cv$lambda.1se)
1
[1,] "0"
This is sensible as the predicted default probability associated with
this class < 0.5.
Adding predicted probabilities and classes to the data frame
We may also wish to add predicted class probabilities and classes to
our original data frame:
> estProbs <- predict(ridge.1se, X, type = "response",
s = ridge.cv$lambda.1se)
> estClass <- predict(ridge.1se, X, type = "class",
s = ridge.cv$lambda.1se)
> Default.recoded.withProbs <- Default.recoded %>%
mutate(probs = c(estProbs), pred = c(estClass))
We can peek at this modified data frame using the command:
> head(Default.recoded.withProbs)
default student balance income probs pred
1 0 0 729.5265 44361.625 0.006038429 0
2 0 1 817.1804 12106.135 0.005738740 0
3 0 0 1073.5492 31767.139 0.020966741 0
4 0 0 529.2506 35704.494 0.002687456 0
5 0 0 785.6559 38463.496 0.007267246 0
6 0 1 919.5885 7491.559 0.008293071 0
Obtaining a confusion matrix and accuracy
A confusion matrix for our ridge logistic regression model fit to training
data is obtained by typing:
> Default.recoded.withProbs %>%
select(default, pred) %>%
table()
pred
default 0 1
0 9658 9
1 284 49
Our training accuracy can be obtained by typing:
> Default.recoded.withProbs %>%
summarize(accuracy = mean(pred == default))
accuracy
1 0.9707
Obtaining lasso parameter estimates as a function of 𝜆
The procedure for performing lasso regression in glmnet is akin to
ridge, except we set the argument alpha (𝛼) to one instead of zero:
> lasso.fit <- glmnet(X, Y, family = "binomial", alpha = 1,
lambda = lambdas)
> plot(lasso.fit, xvar = "lambda")
> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))
Using 𝑘-fold CV to fit a lasso logistic regression model
We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.
Specifically, we want to use lasso (𝛼 = 1) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.
To do this, we use the cv.glmnet() function and set the argument
family to "binomial" to indicate logistic regression, the argument
alpha (𝛼) to one for lasso regression, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds
(𝑘) argument equal to 10.
> lasso.cv <- cv.glmnet(X, Y, family = "binomial", alpha = 1,
lambda = lambdas, nfolds = 10)
Plotting the cross-validation error rate as a function of 𝜆
We can view the cross-validation error rate as a function of 𝜆 with the command:
> plot(lasso.cv)
Recall that numbers at the top show the number of selected features,
which take values of 0,1,…,𝑝.
Estimating the best model (simplest with smallest CV error)
Recall that it is sometimes better to obtain the 𝜆 of the simplest model that is within one standard error of (i.e., not statistically different from)
the best model.
Given our cross-validated object lasso.cv, we can access this 𝜆 using the command lasso.cv$lambda.1se.
We can then find this best-fit, yet simple, lasso logistic regression
model by typing:
> lasso.1se <- glmnet(X, Y, family = "binomial", alpha = 1,
lambda = lasso.cv$lambda.1se)
Obtaining parameter estimates
We can view the parameter estimates using the coef() function:
> coef(ridge.1se)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -8.128411e+00
balance 3.844218e-03
income 4.968183e-06
student -2.279153e-01
These are close to our logistic regression estimates:
> glm.mult.fit
Call: glm(formula = default ~ balance + income + student, family = "binomial",
data = Default.recoded)
Coefficients:
(Intercept) balance income student
-1.087e+01 5.737e-03 3.033e-06 -6.468e-01
Degrees of Freedom: 9999 Total (i.e. Null); 9996 Residual
Null Deviance: 2921
Residual Deviance: 1572 AIC: 1580
Predicting class probabilities and predictions
Let us also predict class probabilities and labels, and add these
predictions to our original data frame in one step:
> estProbs <- predict(lasso.1se, X, type = "response",
s = lasso.cv$lambda.1se)
> estClass <- predict(lasso.1se, X, type = "class",
s = lasso.cv$lambda.1se)
> Default.recoded.withProbs <- Default.recoded %>%
mutate(probs = c(estProbs), pred = c(estClass))
We can peek at this modified data frame using the command:
> head(Default.recoded.withProbs)
default student balance income probs pred
1 0 0 729.5265 44361.625 0.004059856 0
2 0 1 817.1804 12106.135 0.005832021 0
3 0 0 1073.5492 31767.139 0.016725900 0
4 0 0 529.2506 35704.494 0.001771389 0
5 0 0 785.6559 38463.496 0.005120073 0
6 0 1 919.5885 7491.559 0.008895485 0
Obtaining a confusion matrix and accuracy
A confusion matrix for our lasso logistic regression model fit to
training data is obtained by typing:
> Default.recoded.withProbs %>%
select(default, pred) %>%
table()
pred
default 0 1
0 9651 16
1 273 60
Our training accuracy can be obtained by typing:
> Default.recoded.withProbs %>%
summarize(accuracy = mean(pred == default))
accuracy
1 0.9711
Logistic regression with 𝐾 > 2 classes
Recall that logistic regression models the probability that 𝑌 is a particular class 𝑗 given features 𝑋 = (𝑋1,𝑋2,…,𝑋𝑝) with the logistic
function:
𝑃 𝑌 = 𝑗|𝑋 = 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑝𝑋𝑝
1 + 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑝𝑋𝑝
Thus far, we have only considered 𝐾 = 2 classes.
Logistic regression with 𝐾 > 2 classes is multinomial regression.
We can implement a penalized version of multinomial regression using the glmnet() function, except we choose the family to be
"multinomial" instead of "binomial".
Review: the mpg data frame
234 observations on 11 features for 38 models of cars:
manufacturer Car manufacturer
model Model name
displ Engine displacement (in liters)
year Year of manufacture
cyl Number of cylinders
trans Type of transmission
drv f=front-wheel, r=rear wheel, 4=four wheel drive
cty City miles per gallon
hwy Highway miles per gallon
fl Fuel type
class Type of car
We will perform penalized logistic (multinomial) regression to predict drv (𝐾 = 3) from 𝑝 = 3 quantitative features (displ, cty, and hwy).
R example of multinomial regression classifier
Let’s first convert drv values of f, r, and 4 to classes 1, 2, and 3:
> mpg.recoded <- mpg %>%
mutate(drv = ifelse(drv == "f", 1,
ifelse(drv == "r", 2, 3)))
> mpg.recoded
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <dbl> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) 1 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) 1 21 29 p compact
3 audi a4 2 2008 4 manual(m6) 1 20 31 p compact
4 audi a4 2 2008 4 auto(av) 1 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) 1 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) 1 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) 1 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 3 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 3 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 3 20 28 p compact
# ... with 224 more rows
Creating the response and input feature matrices
Recall that glmnet requires that response and feature datasets be
provided separately and as matrices instead of data frames.
We will again use the as.matrix() function to convert data frames
to matrices.
Using this function, we obtain the response (𝑌) and feature (𝑋) matrices as:
> Y <- mpg.recoded %>% select(drv) %>% as.matrix()
> X <- mpg.recoded %>% select(displ, cty, hwy) %>% as.matrix()
The response and input feature matrices
We next use the head() function to peek at the first few rows of the
response and feature matrices:
> head(Y)
drv
[1,] 1
[2,] 1
[3,] 1
[4,] 1
[5,] 1
[6,] 1
> head(X)
displ cty hwy
[1,] 1.8 18 29
[2,] 1.8 21 29
[3,] 2.0 20 31
[4,] 2.0 21 30
[5,] 2.8 16 26
[6,] 2.8 18 26
Using penalized multinomial regression to predict drv
We will use penalized multinomial regression to develop a classifier to
predict whether a car uses front-wheel, rear-wheel, or four-wheel
drive based on its engine size and city and highway mileage.
We will fit this penalized multinomial model with a lasso penalty using
the command:
> lasso.fit <- glmnet(X, Y, family = "multinomial",
alpha = 1, lambda = lambdas)
If we wanted to perform ridge regression, then we would have instead set alpha = 0.
Obtaining lasso parameter estimates as a function of 𝜆
> plot(lasso.fit, xvar = "lambda")
> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))
Obtaining lasso parameter estimates as a function of 𝜆
> plot(lasso.fit, xvar = "lambda")
> legend("bottomright", lwd=1, col=1:3, legend = colnames(X))
Using 𝑘-fold CV to fit a lasso logistic regression model
We next wish to perform 𝑘 = 10-fold cross-validation as a function of the set of 𝜆 values in lambdas.
Specifically, we want to use lasso (𝛼 = 1) to calculate the 10-fold cross-validation error rate for each of the 100 𝜆 values.
To do this, we use the cv.glmnet() function and set the argument
family to "multinomial" to indicate multinomial regression, the
argument alpha (𝛼) to one for lasso regression, provide the argument lambda (𝜆) our set of tuning parameter values lambdas, and set the nfolds (𝑘) argument equal to 10:
> lasso.cv <- cv.glmnet(X, Y, family = "multinomial",
alpha = 1, lambda = lambdas,
nfolds = 10)
Plotting the cross-validation error rate as a function of 𝜆
We can view the cross-validation error rate as a function of 𝜆 with the command:
> plot(lasso.cv)
Recall that numbers at the top show the number of selected features,
which take values of 0,1,…,𝑝.
Estimating the best model (simplest with smallest CV error)
Recall that it is sometimes it is better to obtain the 𝜆 of the simplest model that is within one standard error of (i.e., not statistically
different from) the best model.
Given our cross-validated object lasso.cv, we can access this 𝜆 using the command lasso.cv$lambda.1se.
We can then find this best-fit, yet simple, lasso multinomial regression
model by typing:
> lasso.1se <- glmnet(X, Y, family = "multinomial",
alpha = 1,
lambda = lasso.cv$lambda.1se)
Obtaining parameter estimates
We can view the parameter estimates using the coef() function:
> coef(lasso.1se) $`1`
4 x 1 sparse Matrix of class "dgCMatrix"
s0
-2.2069847
displ .
cty .
hwy 0.1255963
$`2`
4 x 1 sparse Matrix of class "dgCMatrix"
s0
-4.463812
displ 1.081043
cty .
hwy .
$`3`
4 x 1 sparse Matrix of class "dgCMatrix"
s0
6.670797
displ .
cty .
hwy -0.249793
Estimating class probabilities
We may wish to know the probabilities of each of the 𝐾 = 3 classes, which we can estimate as:
> estProbs <- predict(lasso.1se, X, type = "response",
s = lasso.cv$lambda.1se)
We can peek at this set of estimated probabilities with the command:
> as_tibble(estProbs)
[1] # A tibble: 234 x 3
`1.1` `2.1` `3.1`
<dbl> <dbl> <dbl>
1 0.867 0.0166 0.116
2 0.867 0.0166 0.116
3 0.924 0.0171 0.0585
4 0.898 0.0189 0.0828
5 0.668 0.0551 0.277
6 0.668 0.0551 0.277
7 0.722 0.0726 0.205
8 0.694 0.0194 0.287
9 0.612 0.0194 0.369
10 0.818 0.0221 0.160
# ... with 224 more rows
Predicting classes and adding them to our data frame
Next, we predict classes and add them to our original data frame:
> estClass <- predict(lasso.1se, X, type = "class",
s = lasso.cv$lambda.1se)
> mpg.recoded.withPreds <- mpg.recoded %>%
mutate(pred = c(estClass))
We can peek at this modified data frame with the command:
> mpg.recoded.withPreds
# A tibble: 234 x 12
manufacturer model displ year cyl trans drv cty hwy fl class pred
<chr> <chr> <dbl> <int> <int> <chr> <dbl> <int> <int> <chr> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) 1 18 29 p compact 1
2 audi a4 1.8 1999 4 manual(m5) 1 21 29 p compact 1
3 audi a4 2 2008 4 manual(m6) 1 20 31 p compact 1
4 audi a4 2 2008 4 auto(av) 1 21 30 p compact 1
5 audi a4 2.8 1999 6 auto(l5) 1 16 26 p compact 1
6 audi a4 2.8 1999 6 manual(m5) 1 18 26 p compact 1
7 audi a4 3.1 2008 6 auto(av) 1 18 27 p compact 1
8 audi a4 quattro 1.8 1999 4 manual(m5) 3 18 26 p compact 1
9 audi a4 quattro 1.8 1999 4 auto(l5) 3 16 25 p compact 1
10 audi a4 quattro 2 2008 4 manual(m6) 3 20 28 p compact 1
# ... with 224 more rows
Obtaining a confusion matrix and accuracy
A confusion matrix for our lasso multinomial regression model fit to
training data is obtained by typing:
> mpg.recoded.withPreds %>%
select(drv, pred) %>%
table()
pred
drv 1 2 3
1 97 1 8
2 4 5 16
3 23 1 79
Our training accuracy can be obtained by typing:
> mpg.recoded.withPreds %>%
summarize(accuracy = mean(pred == drv))
# A tibble: 1 x 1
accuracy
<dbl>
1 0.774
Summary of this lecture
We assess test accuracy of models with cross-validation, which involves
randomly and repeatedly splitting the data into training and test sets,
and then computing mean performance across iterations.
There are two major types of cross-validation: 𝑘-fold and LOOCV.
In feature selection, we try to minimize test error while also minimizing
the number of features used to make predictions.
We discussed three important algorithms for “traditional” feature
selection: best subset, forward, and background selection.
We also discussed three regularization approaches that “shrink”
parameter estimates toward zero: ridge, lasso, and elastic net.