Python Programming

Rico123
Optimization.ipynb

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Homework 4.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "wpTRNuz3KDWV" }, "source": [ "# Problem Set 4\n", "In this problem set you will get some practice with the proximal gradient algorithm, and also acceleration. Specifically, you will be implementing ISTA and FISTA" ] }, { "cell_type": "markdown", "metadata": { "id": "rSlW0VqxKPdg" }, "source": [ "# Problem 1: Gradient Descent and Acceleration\n", "In this problem you will explore the impact of ill-conditioning on gradient descent, and will then see how acceleration can improve the situation. This exercise will walk you through a very similar situation as to what we saw in the lecture videos that illustrate the performance of gradient descent vs accelerated gradient descent as the condition number (ratio of largest to smallest eigenvalues of the Hessian) increases. This is a ``toy'' problem, but it is still instructive regarding the performance of these two algorithms.\n", "\n", "You will work with the following simple function:\n", "$$\n", "f(x) = \\frac{1}{2}x^{\\top}Qx,\n", "$$\n", "where $Q$ is a 2 by 2 matrix, as defined below." ] }, { "cell_type": "code", "metadata": { "id": "IIUDm7AuKcdm" }, "source": [ "# We create the data for this simple problem. We will create three quadratics.\n", "# Q_wc -- this is a well-conditioned matrix\n", "# Q_ic -- this is an ill-conditioned matrix\n", "# Q_sic -- this is... a somewhat-ill-conditioned matrix (a technical term!)\n", "import numpy as np\n", "Q_wc = np.array([[1,0.3],[0.3,1]]); q = np.array([0,0]);\n", "Q_sic = np.array([[1,0.85],[0.85,1]]); q = np.array([0,0]);\n", "Q_ic = np.array([[1,0.99],[0.99,1]]); q = np.array([0,0]);\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cygDOkGVKdcH" }, "source": [ "## Part (A): \n", "Consider the quadratic functions $f_{wc}$, $f_{sic}$, and $f_{ic}$ defined by the quadratic matrices above. For each of these, say whether they are $\\beta$-smooth and/or $\\alpha$-strongly convex, and if so, compute the value of the condition number, $\\kappa = \\beta/\\alpha$ for each function." ] }, { "cell_type": "markdown", "metadata": { "id": "La1aIXOoKiVa" }, "source": [ "## Part (B): \n", "Compute the best fixed step size for gradient descent, and the best parameters for accelerated gradient descent. For each function, plot the error $(f(x_t) - f(x^{\\ast})$ as a function of the number of iterations. For each function, plot these on the same plot so you can compare -- so you should have 3 plots total." ] }, { "cell_type": "markdown", "metadata": { "id": "-YSu7lbrKlmJ" }, "source": [ "# Problem 2: ISTA and FISTA\n", "Recall the least squares problem with $\\ell^1$ regularization from the previous homework:\n", "$$\n", "\\min_x \\left[f(x) = \\frac{1}{2}\\|{Ax-b}\\|_2^2 + \\lambda \\|{x}\\|_1 \\right]\n", "$$\n", "\n", "Recall key characteristics of this problem: it is nonsmooth due to the regularization term, and it is not strongly convex when $A$ has more columns than rows. This is why you used the sub-gradient method on the previous problem set, rather than Gradient descent. \n", "\n", "Recall the goal of the proximal gradient algorithm: when we have a composite function, i.e., a function of the form $f(x) = g(x) + h(x)$, if $g(x)$ is $\\beta$-smooth and $h(x)$ is ``simple'' in the sense that it has a simple prox function, then rather than using the subgradient method, we can get much better results by using proximal gradient, which takes advantage of the fact that $g(x)$ is smooth. We can improve this further by combining the proximal gradient method with acceleration. \n", "\n", "Using the same data (same $A$ and $b$) as in Problem Set 3, minimize $f(x)$ using $10^4$ iterations with $t=0$ and $x_0 =0$. \n", "\n", "Use the proximal gradient algorithm, also known as ISTA for the case where $f$ is the LASSO objective. Now use the accelerated proximal gradient algorithm, also, known as FISTA. Plot these results on the same plot as your results for sub-gradient descent from the previous lecture. " ] }, { "cell_type": "code", "metadata": { "id": "c92ZUaqytFPR" }, "source": [ "import numpy as np\n", "import numpy.random as rn\n", "import numpy.linalg as la\n", "import matplotlib.pyplot as plt\n", "import time\n", "\n", "A = np.load(\"A.npy\")\n", "b = np.load(\"b.npy\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mK5xRMjtKqWj" }, "source": [ "# Problem 3: Optional -- Why we use LASSO\n", "\n", "As an optional exercise, you may want to play around with Lasso and explore its properties in the context of machine learning problems. \n", "\n", "To do this: (A) Generate data for yourself at random: create a $n \\times d$ matrix $A$ where each entry comes from a standard Gaussian. Choose $d$ much larger than $n$, say, $d = 1000$ and $n = 100$. Now choose the true solution, $x^{\\ast}$ to be a $k$-sparse vector. You can do this in many ways. One simple approach is to let $x^{\\ast}$ equal 10 on $k=5$ randomly chosen entries, and then zero every where else. Finally, generate $y$ according to\n", "$$\n", "y = Ax + \\epsilon,\n", "$$\n", "where $\\epsilon$ is zero mean Gaussian noise with variance $0.1$.\n", "\n", "Now solve (via an algorithm of your choice) Lasso. Note that you will have to search for a good value for $\\lambda$. Compare the solution you get $\\hat{x}_{\\rm lasso}$ with the true solution, as you vary $\\lambda$. You may also want to compare it to the solution when you do not have any regularization. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "DCw3ZCBGKtAS" }, "source": [ "# Problem 4: Logistic Regression\n", "\n", "Logistic regression is a simple statistical classification method which models\n", "the conditional distribution of the class variable $y$ being equal to class $c$\n", "given an input $x \\in \\mathbb{R}^n$. We will examine two classification tasks, one\n", "classifying newsgroup posts, and the other classifying digits. In these tasks\n", "the input $x$ is some description of the sample (e.g., word counts in the news\n", "case) and $y$ is the category the sample belongs to (e.g., sports, politics).\n", "The Logistic Regression model assumes the class distribution conditioned on $x$\n", "is log-linear:\n", "$$\n", "p(y=c|x,b_{1:C}) = \\frac{e^{-b_c^\\top x}}{\\sum_{j=1}^C e^{-b_j^\\top x}},\n", "$$\n", "where $C$ is the total number of classes, and the denominator sums over all\n", "classes to ensure that $p(y|x)$ is a proper probability distribution. Each\n", "class $c \\in {1,2, \\dots, C}$ has a parameter $b_c$, and $\\mathbf{b} \\in\n", "\\mathbb{R}^{nC}$ is the vector of concatenated parameters $\\mathbf{b} =\n", "[b_1^\\top,b_2^\\top,\\dots,b_C^\\top]^\\top$. Let $X \\in \\mathbb{R}^{N \\times n}$ be the\n", "data matrix where each sample $x_i^\\top$ is a row and $N$ is the number of\n", "samples. The maximum likelihood approach seeks to find the parameter\n", "$\\mathbf{b}$ which maximizes the likelihood of the classes given the input data\n", "and the model:\n", "\n", "$$\n", "\\max_{b_{1:C}} \\; p(y|x,b_{1:C}) = \\prod_{i=1}^N p(y_i|x_i,b_{1:C}) = \\prod_{i=1}^N \\frac{e^{-b_{y_i}^\\top x_i}}{\\sum_{j=1}^C e^{-b_j^\\top x_i}}.\n", "$$\n", "\n", "For the purposes of optimization, we can equivalently minimize the negative log\n", "likelihood:\n", "$$\n", "\\min_\\mathbf{\\beta} \\ell(\\mathbf{\\beta}) = -\\log p(\\textbf{y}|X, \\mathbf{\\beta}) = \\sum_{i=1}^N \\left( \\beta_{y_i}^\\top x_i + \\log{\\sum_{j=1}^C e^{-\\beta_j^\\top x_i}} \\right).\n", "$$\n", "\n", "After optimization, the model can be used to classify a new input by choosing\n", "the class that the model predicts as having the highest likelihood; note that\n", "we don't have to compute the normalizing quantity $\\sum_{j=1}^C e^{-b_j^\\top x}$\n", "as it is constant across all classes:\n", "$$\n", "y = \\arg\\max_j p(y=j| x, \\mathbf{\\beta}) = \\arg\\min_j \\beta_j^\\top x\n", "$$\n", "In this problem, you will optimize the logistic regression model for the two\n", "classification tasks mentioned above which vary in dimension and number of\n", "classes. The newsgroup dataset that we consider here has $C=20$. \n", "\n", "We will compare the performance of gradient descent and Nesterov's accelerated gradient\n", "method on the $\\ell^2$-regularized version of the logistic regression model:\n", "$$\n", "\\min_{\\boldsymbol{\\beta}} = \\frac{1}{N} \\sum_{i=1}^N \\left( \\beta_{y_i}^\\top x_i + \\log{\\sum_{j=1}^C e^{-\\beta_j^\\top x_i}} \\right) + \\mu \\|\\boldsymbol{\\beta}\\|^2.\n", "$$\n", "\n", "In this homework, we will use the training and testing data contained in the four csv files in logistic\\_news.zip. In a later homework, we will look into the digits dataset (MNIST).\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "fqKRqJrtshSk" }, "source": [ "#sample code to load in logistic_news.zip\n", "#we also create a Z matrix, useful for multiclass logistic regression optimization\n", "import zipfile as zipfile\n", "import csv\n", "\n", "class Data:\n", " def __init__(self,X,Y):\n", " self.X=X\n", " self.Y=Y \n", "\n", "def loaddata(filename):\n", " import io\n", " data=[]\n", " with zipfile.ZipFile(filename) as z:\n", " for filename in z.namelist():\n", " if filename[0]=='X'or filename[0]=='y':\n", " each=[]\n", " with z.open(filename) as csvDataFile:\n", " csvReader=csv.reader(csvDataFile)\n", " csvDataFile_utf8 = io.TextIOWrapper(csvDataFile, 'utf-8')\n", " csvReader=csv.reader(csvDataFile_utf8)\n", " for row in csvReader:\n", " each.append(row)\n", " each==[[float(string) for string in row] for row in each]\n", " each=np.asarray(each)\n", " data.append(each) \n", " X_te=data[0].astype(float)\n", " X_tr=data[1].astype(float)\n", " y_te=data[2][0].astype(int)\n", " y_tr=data[3][0].astype(int)\n", " Z_tr,Z_te=[],[]\n", " for j in range(len(np.unique(y_tr))):\n", " Z_tr.append(np.sum(X_tr[np.where(y_tr==j)[0],:],axis=0))\n", " Z_te.append(np.sum(X_te[np.where(y_te==j)[0],:],axis=0))\n", " Z_tr=np.asarray(Z_tr).T\n", " Z_te=np.asarray(Z_te).T\n", " train= Data(X_tr,y_tr)\n", " train.Z=Z_tr\n", " test = Data(X_te,y_te)\n", " test.Z=Z_te\n", " return train,test\n", "\n", "train,test=loaddata('./logistic_news.zip')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "sO78y9YxKvui" }, "source": [ "## Part (A) -- Optional -- \n", "Find the value of $\\mu$ that gives you (approximately) the best generalization performance (error on test set). You obtain this by solving the the above optimization problem for different values of $\\mu$, and then checking the performance of the solution on the testing set, using the unregularized logistic regression loss. Note that this is not a question about an optimization method.\n", "\n", "What value do you get for the test loss after convergence? " ] }, { "cell_type": "markdown", "metadata": { "id": "VxkrZ-d2rTl3" }, "source": [ "## Part (B) \n", "\n", "If you did Part (A), use the value of $\\mu$ you found there. If you did not, use $\\mu = 0.001$. \n", "\n", "Plot the loss against iterations for both the test and training data\n", "using the value of $\\mu$ from part (a)." ] }, { "cell_type": "markdown", "metadata": { "id": "0SidZhForTAT" }, "source": [ "## Part (C) \n", "\n", "How do the two algorithms differ in performance, and how does this change\n", "as you decrease $\\mu$? " ] } ] }