R statistics Programming

profilecasey7677
Gradient_Descen.pdf

Homework 1 John Ortiz

February 22, 2019

Class: ALY6020 - Predictive Analytics **Professor:** Marco Montes de Oca

Virtually all regression, classification or clustering problems in machine learning require solving an optimization problem. In all of these problems, the objective can be understood as minimizing a loss function while complying to a specific set of constrains. For this reason, it is necessary to understand how some of these optimization problems are solved. One of the simplest and more commonly used algorithms to optimize functions is the Gradient Descent. The Gradient Descent works by calculating (or estimating) the partial derivatives of the function with respect to the parameters used for the optimization. These partial derivatives (also known as the gradient) are then used to update the parameters until a local minimum is reached. This algorithm exploits the fact that the gradient conveys information on the rate of change of the function when taking an infinitely small step in the positive direction (Bushaev, 2015). This means that by updating the parameters against this direction of change, the function will decrease. Keep in mind that it is not feasible to take infinitely small steps. Instead, the step size is controlled via an additional parameter called the learning rate. Mathematically this is expressed as:

minθ1,θ2,...,θp l(x,θ) θ1,θ2, ...,θp parameters

Gradient descent method for an individual parameter:

θi+1 = θi −α ∂l

∂θi Where α is the learning rate: Overall:

Θ = [ θ1 θ2 ... θp

] ∇l =

[ ∂l ∂θi

∂l ∂θ2

... ∂l ∂θp

] Θi+1 = Θi −α∇l

In the following work we used the gradient descent method to minimize the sum of errors for two different problem sets. Additionally we explored the effect of the number of iterations and learning rate on the accuracy and performance of the method. The initial problem set and code for this implementation were obtained from https://docs.google.com/document/d/1Hcc04z990QrU0UsyEenv2pPY262MitoP0N8oPscdU34/ edit?usp=sharing.

Problem 1

min

p∑ i=1

[θi − 1]2

#This is the function we wish to minimize (can be changed) function.to.optimize <- function(x,data = NULL) {

#Optimal point = (1,1,...,1) optimal_point <- rep(1,length(x))

return(sum((x - optimal_point)^2)) }

1

Analyzing the objective function, it is clear that the minimum value is attained when all parameters are equal to 1. Therefore the expected result of applying the gradient descent method is that all parameters should be close to one.

There are several ways to compute the gradient of a function. In this particular case it was decided to approximate the partial derivatives of the function to the first order finite differences for each parameter. This is because of the simplicity of this approximation and because the objective function is well-behaved and continuous on the domain R. Mathematically this approximation is expressed as follows:

∂l

∂θi ≈

δl

δθi ≈ l(θi + h,θ2, ...,θp) − l(θi,θ2, ...,θp)

h

for a sufficiently small h, Overall the gradient is estimated as:

∇l ≈ [ δl δθi

δl δθ2

... δl δθp

] #Computes the gradient of a function fun, at the point x get.gradient <- function(fun, x, data = NULL) {

h <- 0.01 # Derivative resolution

gradient <- rep(0, length(x)) # Initialization of gradient vector

for (i in 1:length(x)) { step.in.i.direction <- rep(0, length(x)) #Initialization of step vector step.in.i.direction[i] <- h # Step in ith direction gradient[i] <- (fun(x + step.in.i.direction,data) - fun(x,data))/h #derivative in ith direction

}

normalized_gradient <- gradient/sqrt(sum(gradient^2)) #Makes gradient a unit vector

return(normalized_gradient) }

The value of h was chosen to provide fast calculation of the finite difference. Following is the implementation of the gradient descent method. grad.descent <- function(fun, init, max_iter, alpha, tol = 1e-6, printing = FALSE,

datapoints = NULL) {

x0 <- init #Initial value of x0 eps <- 1 #initial error i <- 1 # First iteration x <- matrix(NA,max_iter,length(init)) #Preallocating x variable value <- rep(NA,max_iter) #Preallocation function value

while (eps > tol){

nabla <- get.gradient(fun,x0,data = datapoints) value0 <- fun(x0, datapoints) x[i,] <- x0 - nabla * alpha #Gradient descent value[i] <- fun(x[i,],datapoints) #Value of the function at new x

eps <- abs(value[i] - value0) #Error in absolute value

2

if (printing == TRUE){ #In case user needs to print the results of each iteration cat("iteration:",i,

"Current Value: ", format(x0, digits = 4, nsmall = 3), " value: ",format(value0, digits = 2,nsmall = 2), " delta: ",format(eps, digits = 4,nsmall = 5),"\n",sep=" ")

}

x0 <- x[i,] i <- i + 1 #iteration counter

if (i > max_iter){ #If the number of iteration is exceeded, break break

}

}

return(data.frame(na.omit(x),Value = na.omit(value))) #return values as data.frame. The NA.omit ensures that if the method converged before reaching the #max number of iterations. the result data frame will not contain missing (NA) values.

}

The function has 4 main arguments: 1. fun: the objective function to minimize. 2. init: the initial values of the parameters. 3. max_iter: an additional parameter to limit the number of iterations. 4. alpha: learning rate.

Additionally two optional parameters are used, tol and printing. The tol parameter allows the user to change the tolerance, which is the difference between the cost function l at different iterations and printing which controls if the results of each iteration should be shown. It is worth noting that if the difference of the function between two iterations is below the tolerance, the derivative is close to 0 and hence the method has converged. initial_guess <- c(3, -8) grad.descent.results <- grad.descent(function.to.optimize, initial_guess,

max_iter = 1000, alpha = 0.03) optimal_point <- grad.descent.results[nrow(grad.descent.results),1:2] optimal_value <- grad.descent.results[nrow(grad.descent.results),3] paste(c("Optimal point is: ",format(optimal_point,digits = 4)), collapse=" ")

## [1] "Optimal point is: 0.9897 1.019"

paste("Value at optimal point is: ",format(optimal_value,digits = 3))

## [1] "Value at optimal point is: 0.000454"

As expected both parameters are very close to the value of 1. Additionally the value of the objective function at this point is very close to 0 which is the minimum value possible for a squared function. The output of the grad.descent function allow us to check how the objective function in each iteration as well as the parameter values. plot(grad.descent.results[,3],col = "red", xlab = "Number of Iterations", ylab = "Objective function",

type = "b", pch = 19, cex = 0.2) legend("topright",legend = "alpha = 0.03", col = "red", pch = 19)

3

0 200 400 600 800 1000

0 2

0 4

0 6

0 8

0

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 0.03

As observed, the objective function reached the minimal value at around 250 iterations. Afterwards it oscillated around this value.

It is important to study the effect of the learning rate and number of iterations in the accuracy and precision of the gradient descent method. To that purpose we have selected 4 different learning rates at three different number of iterations.

α = [1x10−2, 5x10−3, 1x10−3, 5x10−4]

max.iter = [1000, 5000, 10000]

alpha <- c(1e-2,5e-3,1e-3,5e-4) max.iter <- c(1000,5000,10000) pos.legends <- c("bottomleft","topright","topright")

par(mfrow = c(1,3)) for (i in 1:length(max.iter)){

results <- matrix(NA,max.iter[i],1)

for (a in alpha){

grad.descent.results <- grad.descent(function.to.optimize, initial_guess, max_iter = max.iter[i], alpha = a)

results <- cbind(results,grad.descent.results[,3])

4

}

results <- data.frame(results[,-1])

matplot(1:max.iter[i],results, main = paste("Max iterations = ", max.iter[i]), xlab = "Number of Iterations", ylab = "Objective function", type = "b", pch = 19, cex = 0.2, col = c("red","blue","green","black"))

legend(pos.legends[i],legend = c("alpha = 1e-2","alpha = 5e-3","alpha = 1e-3","alpha = 5e-4"), col = c("red","blue","green","black"), pch = 19, cex = 0.9)

}

0 200 600 1000

0 2

0 4

0 6

0 8

0

Max iterations = 1000

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 1e−2 alpha = 5e−3 alpha = 1e−3 alpha = 5e−4

0 1000 3000 5000

0 2

0 4

0 6

0 8

0

Max iterations = 5000

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 1e−2 alpha = 5e−3 alpha = 1e−3 alpha = 5e−4

0 2000 6000 10000

0 2

0 4

0 6

0 8

0

Max iterations = 10000

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 1e−2 alpha = 5e−3 alpha = 1e−3 alpha = 5e−4

As observed in the figures, the learning rate has a significant effect on the efficacy and accuracy of the gradient descent method. At high learning rates (10−2) a solution is obtained with less than 1000 iterations. The solution is close to zero as expected however it is not as precise when compared with the results obtained at lower learning rates. However the number of iterations needed to reach this accuracy was significantly large. At the lowest learning rate, a solution has not yet been attained. While it is possible that the method reaches a more precise solution at higher number of iterations and lower alpha, it is not always recommendable due to memory and time limitations.

Problem 2 Estimating the mean and standard deviation of a sample data Normally Distributed.

In this problem we will use the log maximum likelihood estimation function applied to the Normal distribution function and formulate the corresponding minimization problem. Let us begin with the probability density function of a Normal distribution:

5

f(x|µ,σ2) = 1

√ 2πσ2

exp( −(x−µ)2

2σ2 )

Assuming that each sample is independent from one another. The probability density function for all the samples can be expressed as:

f(x1,x2, ...,xn|µ,σ2) = ( 1

√ 2πσ2

)nexp( n∑ i=1

−(xi −µ)2

2σ2i )

Assuming homocedasticity (equal variance for all data samples, the expression can be simplified as:

f(x1,x2, ...,xn|µ,σ2) = ( 1

√ 2πσ2

)nexp( − ∑n i=1(xi −µ)

2

2σ2 )

Then we apply logarithm to this function:

log(f(x1,x2, ...,xn|µ,σ2)) = log(( 1

√ 2πσ2

)nexp( − ∑n i=1(xi −µ)

2

2σ2 ))

log(f(x1,x2, ...,xn|µ,σ2)) = − n

2 log(2πσ2) −

∑n i=1(xi −µ)

2

2σ2 The objective is to maximize the log-likelihood function by changing parameters µ and sigma

maxµ,σ − n

2 log(2πσ2) −

∑n i=1(xi −µ)

2

2σ2

log.likelihood.normal <- function(x,data){

n <- length(data[,1]) mu <- x[1] sigma <- x[2]

z <- (data - mu)/sigma

log_lik_norm <- (n / 2) * log(2*pi*sigma^2) + sum(z^2)/2 #signs inverted to transform maximization into minimization problem return(log_lik_norm)

}

Before applying the gradient descent to minimize this function, we need if the sample data is normally distributed by plotting the histogram of the data: data <- read.csv("sample.csv") hist(data[,1], xlab = "Datapoints", main = "Histogram of sample data")

6

Histogram of sample data

Datapoints

F re

q u

e n

cy

−2 0 2 4 6 8

0 5

0 0

0 1

0 0

0 0

1 5

0 0

0

As observed in the histogram, the sample data follows a Normal distribution. If there is doubt about the distribution it is possible to run normality tests. However given the bell curve in the histogram and the large number of datapoints (≈ 100000), we can safely assume that the data follows a Normal distribution.

Additionally based on the information of the histogram we can foree that the mean value should be slighlty above 2. in a normal distribution, around 68% of the data is within one standard deviation. Hence: q1 <- as.numeric(quantile(data[,1],c(0.167,0.833))) est.sd <- (q1[2] - q1[1])/2 print(est.sd)

## [1] 1.256357

The estimation of a standard deviation is 1.256.

finally The gradient descent method to maximize the log-likelihood function. initial_guess <- c(3, 1) grad.descent.results <- grad.descent(log.likelihood.normal, initial_guess,

max_iter = 150, alpha = 1e-2,datapoints = data)

sample.mean <- mean(data[,1]) sample.sd <- sd(data[,1])

optimal_point <- grad.descent.results[nrow(grad.descent.results),1:2] optimal_value <- grad.descent.results[nrow(grad.descent.results),3] paste(c("Optimal point is: ",format(optimal_point,digits = 4)), collapse=" ")

## [1] "Optimal point is: 2.329 1.302"

7

paste(c("sample mean is: ",format(sample.mean,digits = 4)), collapse=" ")

## [1] "sample mean is: 2.334"

paste(c("sample standard deviation is: ",format(sample.sd,nsmall = 4)), collapse=" ")

## [1] "sample standard deviation is: 1.299684"

plot(grad.descent.results[,3],col = "red", xlab = "Number of Iterations", ylab = "Objective function", type = "b", pch = 19, cex = 0.2)

legend("topright",legend = "alpha = 0.01", col = "red", pch = 19)

0 50 100 150

1 7

0 0

0 0

1 8

0 0

0 0

1 9

0 0

0 0

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 0.01

The estimates of the population mean and standard deviation were 2.329 and 1.302. These values are close to the expected results based on the histogram. Moreover the value are very close to the sample mean and standard deviation calculated for the overall dataset.

Similarly to Problem 1, we study the effect of the learning rate and number of iterations in the accuracy and precision of the gradient descent method in this problem. Given the complexity of the problem and the large number of dataset, we have selected a restricted number of alpha and number of iterations.

α = [1x10−2, 5x10−3, 1x10−3, 5x10−4]

max.iter = [100, 500, 2000]

alpha <- c(1e-2,5e-3,1e-3,5e-4) max.iter <- c(100,500,2000) initial_guess <- c(3,1) pos.legends <- c("bottomleft","topright","topright")

8

mean.vals <- matrix(NA,length(max.iter),length(alpha)) sd.vals <- matrix(NA,length(max.iter),length(alpha))

colnames(mean.vals) <- alpha colnames(sd.vals) <- alpha

rownames(mean.vals) <- max.iter rownames(sd.vals) <- max.iter

par(mfrow = c(1,3)) for (i in 1:length(max.iter)){

results <- matrix(NA,max.iter[i],1)

for (j in 1:length(alpha)){

grad.descent.results <- grad.descent(log.likelihood.normal, initial_guess, max_iter = max.iter[i], alpha = alpha[j], datapoints = data)

n <- nrow(grad.descent.results) mean.vals[i,j] <- grad.descent.results[n,1] sd.vals[i,j] <- grad.descent.results[n,2] results <- cbind(results,grad.descent.results[,3])

}

results <- data.frame(results[,-1])

matplot(1:max.iter[i],results, main = paste("Max iterations = ", max.iter[i]), xlab = "Number of Iterations", ylab = "Objective function", type = "b", pch = 19, cex = 0.2, col = c("red","blue","green","black"))

legend(pos.legends[i],legend = c("alpha = 1e-2","alpha = 5e-3","alpha = 1e-3","alpha = 5e-4"), col = c("red","blue","green","black"), pch = 19, cex = 0.9)

}

9

0 20 40 60 80 100

1 7

0 0

0 0

1 7

5 0

0 0

1 8

0 0

0 0

1 8

5 0

0 0

1 9

0 0

0 0

1 9

5 0

0 0

Max iterations = 100

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 1e−2 alpha = 5e−3 alpha = 1e−3 alpha = 5e−4

0 100 300 500

1 7

0 0

0 0

1 7

5 0

0 0

1 8

0 0

0 0

1 8

5 0

0 0

1 9

0 0

0 0

1 9

5 0

0 0

Max iterations = 500

Number of Iterations

O b

je ct

iv e

f u

n ct

io n

alpha = 1e−2 alpha = 5e−3 alpha = 1e−3 alpha = 5e−4

0 500 1000 2000

1 7

0 0

0 0

1 7

5 0

0 0

1 8

0 0

0 0

1 8

5 0

0 0

1 9

0 0

0 0

1 9

5 0

0 0

Max iterations = 2000

Number of Iterations O

b je

ct iv

e f

u n

ct io

n

alpha = 1e−2 alpha = 5e−3 alpha = 1e−3 alpha = 5e−4

As observed on the figure, at low number of iterations, convergence was achieved only with the highest learning rate. At 500 iterations, only learning rates above 5x10−3 had converged to a solution. The other curves are slowly approaching a solution. Finally at 2000 iterations, all learning rates have reached the same solution It is worth noting that an order of magnitude of difference between the learning rates produces a significant increase in the number of iterations required for convergence. Between 1x10−2 and 1x10−3 there are over 600 extra iterations. Between 5x10−3 and 5x10−4 there are over 1500 extra steps required for convergence. print("Mean matrix rows = Max Iterations, cols = Alpha")

## [1] "Mean matrix rows = Max Iterations, cols = Alpha"

print(mean.vals)

## 0.01 0.005 0.001 5e-04 ## 100 2.328864 2.618702 2.943756 2.973000 ## 500 2.328864 2.328864 2.618067 2.840749 ## 2000 2.328864 2.328864 2.328864 2.328864

print("stardard deviation matrix rows = Max Iterations, cols = Alpha")

## [1] "stardard deviation matrix rows = Max Iterations, cols = Alpha"

print(sd.vals)

## 0.01 0.005 0.001 5e-04 ## 100 1.301767 1.299255 1.082623 1.042077 ## 500 1.301767 1.292963 1.298538 1.191252 ## 2000 1.301767 1.292963 1.294352 1.294501

10

The first table shows the final mean value per number of iterations (rows) and learning rate. It is worth noting that the method reached to the same mean value for all learning rates at 2000 iterations. This could imply that this value is the true population mean of the data, however further studies at lower learning rates might be required. On the other hand, the second table shows the standard deviation at different iterations and learning rates. At low values of alpha, the optimal value is slightly off from the true standard deviation. This value becomes more accurate as alpha increases.

In conclusion the gradient descent method is a powerful method to solve optimization methods with relative ease. The results show that the method is highly dependent on the number of iterations and learning rate, hence cross validation methods and sensitivity analysis are required to fully define the proper value for these parameters. Additionally, the resolution of the derivative could also be a hyper parameter to study in these sensitivty analysis since it affects the accuracy of the estimation of the partial derivatives.

Bibliography * Bushaev, V. (2015, November 17). How do we ‘train’ neural networks. Retrieved from Towards Data Science: https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73 *

11