It is a homework for "Data Vision" class
Homework 3 Backpropagation in Neural Networks
What do you need to do? Your goal is to train two networks using the Backpropagation training method. The first network is for the 2-layer XOR problem. The second network you need to train is for the Multiplication of 2 integers problem.
What happens in a Forward Pass?
What happens in a Backward Pass?
Why do we need this activation function?
- For the XOR problem, the train and test dataset both consist only of 4 points.
- For the Multiplication problem, the train dataset is pair of numbers (x1, x2) with ground-truth value (y = x1 * x2).
We restrict the inputs to (-100,100) and generate a train dataset of 1000 points, Validation set of 200 points and Test dataset of 5000 points within the same domain.
- Once your datasets are ready, what comes next?
- Both networks have 1 input layer, 1 hidden layer and 1 output layer.
For the XOR problem, there are 2 input neurons, 2 (variable) hidden neurons followed by an activation function and 1 output neuron followed by a Sigmoid function which squashes the output between 0 and 1.
- For the Multiplication problem, there are 2 input neurons, N (variable) hidden neurons followed by an activation function and 1 output neuron. The output neuron should not be followed by any activation function.
What’s Next?
- Define a loss function that quantifies our unhappiness with the scores across the training data
- Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)
Loss Function
If you’ve seen linear regression before, you may recognize this as the familiar least-squares cost function that gives rise to the ordinary least squares regression model.
We want to choose θ so as to minimize J(θ). To do so, let’s use a search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ). Specifically, let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:
Gradient Descent
(This update is simultaneously performed for all values of j = 0,...,n.) Here, α is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.
Let’s first work it out for the case of if we have only one training example (x,y), so that we can neglect the sum in the definition of J. We have:
Gradient Descent
For a single training example, this gives the update rule:
Batch Gradient Descent
This method looks at every example in the entire training set on every step, and is called batch gradient descent.
Stochastic Gradient Descent
In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent).
- Great. You have trained the network with Gradient Descent. But you need to continuously monitor certain quantities during training of a neural network.
- The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass.
- Ratio of weights:updates
- The last quantity you might want to track is the ratio of the update magnitudes to the value magnitudes. Note: updates, not the raw gradients You might want to evaluate and track this ratio for every set of parameters independently.
Experiments
What works better? - Batch Gradient Descent / Stochastic Gradient Descent
- Is your model Overfitting? Try adding Regularization to the Loss function.
- Is your model Underfitting? (Training accuracy is low and there is no significant gap between training and validation error) Try adding more layers to the network or playing with the number of neurons in the hidden layers. How does Depth vs Breadth affect the network?
- Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off.
References [1] Some slides are taken from the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.
[2] Some formulas/text referenced from Andrew Ng Course Notes for Machine Learning