Python Programming
Problem Set 4 In this problem set you will get some practice with the proximal gradient algorithm, and also acceleration. Specifically, you will be implementing ISTA and FISTA
Problem 1: Gradient Descent and Acceleration In this problem you will explore the impact of ill-conditioning on gradient descent, and will then see how acceleration can improve the situation. This exercise will walk you through a very similar situation as to what we saw in the lecture videos that illustrate the performance of gradient descent vs accelerated gradient descent as the condition number (ratio of largest to smallest eigenvalues of the Hessian) increases. This is a ``toy'' problem, but it is still instructive regarding the performance of these two algorithms.
You will work with the following simple function:
where is a 2 by 2 matrix, as defined below.
𝑓(𝑥) = 𝑄𝑥, 1
2 𝑥 ⊤
𝑄
In [ ]:
Part (A): Consider the quadratic functions , , and defined by the quadratic matrices above. For each of these, say whether they are -smooth and/or -strongly convex, and if so, compute the value of the condition number, for each function.
𝑓𝑤𝑐 𝑓𝑠𝑖𝑐 𝑓𝑖𝑐 𝛽 𝛼
𝜅= 𝛽/𝛼
Part (B): Compute the best fixed step size for gradient descent, and the best parameters for accelerated gradient descent. For each function, plot the error as a function of the number of iterations. For each function, plot these on the same plot so you can compare -- so you should have 3 plots total.
(𝑓( )−𝑓( )𝑥𝑡 𝑥 ∗
Problem 2: ISTA and FISTA Recall the least squares problem with regularization from the previous homework:ℓ1
[𝑓(𝑥) = ‖𝐴𝑥− 𝑏 +𝜆‖𝑥 ]min𝑥 1
2 ‖2 2
‖1
# We create the data for this simple problem. We will create three quadratics. # Q_wc -- this is a well-conditioned matrix # Q_ic -- this is an ill-conditioned matrix # Q_sic -- this is... a somewhat-ill-conditioned matrix (a technical term!) import numpy as np Q_wc = np.array([[1,0.3],[0.3,1]]); q = np.array([0,0]); Q_sic = np.array([[1,0.85],[0.85,1]]); q = np.array([0,0]); Q_ic = np.array([[1,0.99],[0.99,1]]); q = np.array([0,0]);
Recall key characteristics of this problem: it is nonsmooth due to the regularization term, and it is not strongly convex when has more columns than rows. This is why you used the sub-gradient method on the previous problem set, rather than Gradient descent.
Recall the goal of the proximal gradient algorithm: when we have a composite function, i.e., a function of the form , if is -smooth and is ``simple'' in the sense that it has a simple prox function, then rather than using the subgradient method, we can get much better results by using proximal gradient, which takes advantage of the fact that is smooth. We can improve this further by combining the proximal gradient method with acceleration.
Using the same data (same and ) as in Problem Set 3, minimize using iterations with and .
Use the proximal gradient algorithm, also known as ISTA for the case where is the LASSO objective. Now use the accelerated proximal gradient algorithm, also, known as FISTA. Plot these results on the same plot as your results for sub-gradient descent from the previous lecture.
𝐴
𝑓(𝑥) = 𝑔(𝑥)+ℎ(𝑥) 𝑔(𝑥) 𝛽 ℎ(𝑥)
𝑔(𝑥)
𝐴 𝑏 𝑓(𝑥) 10 4
𝑡= 0
= 0𝑥0
𝑓
In [ ]:
Problem 3: Optional -- Why we use LASSO As an optional exercise, you may want to play around with Lasso and explore its properties in the context of machine learning problems.
To do this: (A) Generate data for yourself at random: create a matrix where each entry comes from a standard Gaussian. Choose much larger than , say, and . Now choose the true solution, to be a -sparse vector. You can do this in many ways. One simple approach is to let equal 10 on randomly chosen entries, and then zero every where else. Finally, generate according to
where is zero mean Gaussian noise with variance .
Now solve (via an algorithm of your choice) Lasso. Note that you will have to search for a good value for . Compare the solution you get with the true solution, as you vary . You may also want to compare it to the solution when you do not have any regularization.
𝑛×𝑑 𝐴
𝑑 𝑛 𝑑= 1000 𝑛= 100
𝑥∗ 𝑘 𝑥∗
𝑘= 5 𝑦
𝑦=𝐴𝑥+ 𝜖,
𝜖 0.1
𝜆
�̂� lasso 𝜆
Problem 4: Logistic Regression Logistic regression is a simple statistical classification method which models the conditional distribution of the class variable being equal to class given an input . We will examine two classification tasks, one classifying newsgroup posts, and the other classifying digits. In these tasks the input is some description of the sample (e.g., word counts in the news case) and is the category the sample belongs to (e.g., sports, politics). The Logistic Regression model assumes the class distribution conditioned on is log-linear:
𝑦 𝑐 𝑥∈ ℝ 𝑛
𝑥
𝑦
𝑥
𝑝(𝑦= 𝑐|𝑥, ) = ,𝑏1:𝐶 𝑒− 𝑥𝑏 ⊤ 𝑐
∑ 𝐶
𝑗=1 𝑒− 𝑥𝑏 ⊤ 𝑗
import numpy as np import numpy.random as rn import numpy.linalg as la import matplotlib.pyplot as plt import time A = np.load("A.npy") b = np.load("b.npy")
where is the total number of classes, and the denominator sums over all classes to ensure that is a proper probability distribution. Each class has a parameter , and is the vector of concatenated parameters . Let be the data matrix where each sample is a row and is the number of samples. The maximum likelihood approach seeks to find the parameter which maximizes the likelihood of the classes given the input data and the model:
For the purposes of optimization, we can equivalently minimize the negative log likelihood:
After optimization, the model can be used to classify a new input by choosing the class that the model predicts as having the highest likelihood; note that we don't have to compute the normalizing quantity
as it is constant across all classes:
In this problem, you will optimize the logistic regression model for the two classification tasks mentioned above which vary in dimension and number of classes. The newsgroup dataset that we consider here has
.
We will compare the performance of gradient descent and Nesterov's accelerated gradient method on the -regularized version of the logistic regression model:
In this homework, we will use the training and testing data contained in the four csv files in logistic_news.zip. In a later homework, we will look into the digits dataset (MNIST).
𝐶 𝑝(𝑦|𝑥)
𝑐∈ 1,2,…,𝐶 𝑏𝑐 𝐛∈ ℝ 𝑛𝐶
𝐛= [ , ,…,𝑏⊤ 1 𝑏⊤ 2
𝑏⊤ 𝐶 ]⊤ 𝑋∈ ℝ
𝑁×𝑛
𝑥⊤ 𝑖
𝑁
𝐛
𝑝(𝑦|𝑥, ) = 𝑝( | , ) = .max 𝑏1:𝐶
𝑏1:𝐶 ∏ 𝑖=1
𝑁
𝑦𝑖 𝑥𝑖 𝑏1:𝐶 ∏ 𝑖=1
𝑁 𝑒 −𝑏 ⊤ 𝑦𝑖 𝑥𝑖
∑𝐶 𝑗=1 𝑒−𝑏 ⊤ 𝑗𝑥𝑖
ℓ(𝛽) =− log𝑝(𝐲|𝑋,𝛽) = (
+log ) .min
𝛽 ∑ 𝑖=1
𝑁
𝛽⊤𝑦𝑖𝑥𝑖 ∑ 𝑗=1
𝐶
𝑒 −𝛽 ⊤ 𝑗 𝑥𝑖
∑𝐶 𝑗=1 𝑒− 𝑥𝑏 ⊤ 𝑗
𝑦= arg 𝑝(𝑦= 𝑗|𝑥,𝛽) = arg 𝑥max 𝑗
min 𝑗 𝛽⊤ 𝑗
𝐶 = 20
ℓ 2
= (
+log ) +𝜇‖𝜷 .min
𝜷
1
𝑁 ∑ 𝑖=1
𝑁
𝛽⊤𝑦𝑖𝑥𝑖 ∑ 𝑗=1
𝐶
𝑒 −𝛽 ⊤ 𝑗 𝑥𝑖 ‖
2
In [ ]:
Part (A) -- Optional -- Find the value of that gives you (approximately) the best generalization performance (error on test set). You obtain this by solving the the above optimization problem for different values of , and then checking the performance of the solution on the testing set, using the unregularized logistic regression loss. Note that this is not a question about an optimization method.
What value do you get for the test loss after convergence?
𝜇
𝜇
Part (B) If you did Part (A), use the value of you found there. If you did not, use .
Plot the loss against iterations for both the test and training data using the value of from part (a).
𝜇 𝜇= 0.001
𝜇
#sample code to load in logistic_news.zip #we also create a Z matrix, useful for multiclass logistic regression optimization import zipfile as zipfile import csv class Data: def __init__(self,X,Y): self.X=X self.Y=Y def loaddata(filename): import io data=[] with zipfile.ZipFile(filename) as z: for filename in z.namelist(): if filename[0]=='X'or filename[0]=='y': each=[] with z.open(filename) as csvDataFile: csvReader=csv.reader(csvDataFile) csvDataFile_utf8 = io.TextIOWrapper(csvDataFile, 'utf-8') csvReader=csv.reader(csvDataFile_utf8) for row in csvReader: each.append(row) each==[[float(string) for string in row] for row in each] each=np.asarray(each) data.append(each) X_te=data[0].astype(float) X_tr=data[1].astype(float) y_te=data[2][0].astype(int) y_tr=data[3][0].astype(int) Z_tr,Z_te=[],[] for j in range(len(np.unique(y_tr))): Z_tr.append(np.sum(X_tr[np.where(y_tr==j)[0],:],axis=0)) Z_te.append(np.sum(X_te[np.where(y_te==j)[0],:],axis=0)) Z_tr=np.asarray(Z_tr).T Z_te=np.asarray(Z_te).T train= Data(X_tr,y_tr) train.Z=Z_tr test = Data(X_te,y_te) test.Z=Z_te return train,test train,test=loaddata('./logistic_news.zip')
Part (C) How do the two algorithms differ in performance, and how does this change as you decrease ?𝜇