It is a homework for "Data Vision" class
Homework 3: Back Propagation Prof. Davi Geiger & TA. Deshana Desai
Due March 27th, 2018
Introduction
Using Backpropagation algorithm to train a two layer XOR problem and a two layer multiplication problem.
Let us define a one hidden layer network with two input units, N hidden layer units and one output unit, with training set of D data samples, using the following notation
1. the input vector for two units as xd = (xd1 , x d
2) or x d
i
; i = 1, 2, d = 1, . . . , D. (For the XOR problem introduced below, 4 data points exist in the dataset - x1 = (1, 0), x2 = (1, 1), x3 = (0, 1), x4 = (0, 0).)
2. the hidden layer with N units as h j
; j = 1, . . . N. Each of these units (or neurons) is connected to all the units of the previous layer with a weight w1
ji
. Here j refers to the unit index of the hidden layer and the i refers to the unit index of the previous layer.
3. the non-linear function ReLu(t) = max(0, t) =
( t t � 0 0 t < 0
.
N is a hyper-parameter we will specify for this homework.
Forward Calculations
The output is then computed as follows
y(xd, w1, w2, b2, b3) = ReLu
N
 j=1
" w
2 j
Relu
b
2 j
+ 2
 i=1
w
1 ji
x
d
i
! + b3
#! (1)
Define the variables
1. o1 j
= b2 j
+ Â2 i=1 w
1 ji
x
d
i
; j = 1, . . . , N.
1
Figure 1: Network architecture.
2. h j
= Relu(o1 j
) ! y = ReLu ⇣
b
3 + ÂN j=1 w
2 j
h
j
⌘ .
3. o2 = b3 + ÂN j=1 w
2 j
h
j
! y = ReLu(o2)
In matrix notation we can write equation 1 as
y(xd, w1, w2, b2, b3) = ReLu(b3 + w2 · h) = Relu(b3 + w2 · Relu(b2 + w1 xd)) (2)
where b3 is a scalar, xd is a vector of length 2, w2, b2, h are vectors of length N, w1 is a matrix of size N ⇥ 2, and v = Relu(u) applied to a vector u, is meant to output a vector v where each entry is the result of applying the Relu function to each entry of the vector u.
Loss Functions
Say we have d = 1, . . . , D labeled examples to be used for training with input and output pairs (xd, td). The loss function we use for training in this homework is
2
El(w 1, w2, b2, b3) =
1 D
D
 d=1
E
d
l(w 1, w2, b2, b3) (3)
where
E
d
l(w 1, w2, b2, b3) = (y(xd, w1, w2, b2, b3) � td)2 + l
⇣ |w1|2 + |w2|2 + |b2|2 + |b3|2
⌘ (4)
where |w1|2 = Â2 i=1 Â
N
j=1(w 1 ji
)2, |w2|2 = ÂN j=1(w
2 j
)2, |b2|2 = ÂN j=1(b
2 j
)2 and l is a hyper- parameter you estimate in this homework.
Say we have v = 1, . . . , V data examples for validation, the validation set is then V
x
= {(xv, tv); v = 1, . . . , V}. In this homework, we also create a set of labeled data for generalization. The gener-
alization set (also said to be the test data) is then G x
= {(xg, tg); g = 1, . . . , G} In both cases, the validation loss and generalization loss is simply
E
V
x
(w1, w2, b2, b3) = 1 V
V
 v=1
(y(xv, w1, w2, b2, b3) � tv)2 (5)
E
G
x
(w1, w2, b2, b3) = 1 G
V,G
 g=1
(y(xg, w1, w2, b2, b3) � tg)2 (6)
Gradient Descent
The general formula for updating the weights and bias is via gradient descent, where the steps are indexed by t = 1, . . . , T, where T is the stopping criteria, another hyper- parameter to be estimated in this homework.
w
1,2(t + 1) = w1,2(t) � h D
 d=1
∂Ed
∂w1,2 (t) b2,3(t + 1) = b2,3(t) � h
D
 d=1
∂Ed
∂b2,3 (t) (7)
where h, the speed of descent/learning is a hyper-parameter to be estimated.
Computing these gradients
In order to compute these formulae more precisely, we will need the derivative of the
Relu(t) function, which can be readily derived ∂Relu(t)∂t =
( 1 t � 0 0 t < 0
3
Top Layer: We have the N weights w2 j
and N bias b2 j
to update as follows
∂Ed
∂w2 j
= 2(y � td) ∂y
∂w2 j
+ 2lw2 j
= 2(y � td) ∂Relu(o2)
∂o2 ∂o2
∂w2 j
+ 2lw2 j
= 2(y � td) ∂Relu(o2)
∂o2 h
j
+ 2lw2 j
= 2lw2 j
+
( 2(y � td)h
j
o
2 � 0 0 o2 < 0
(8)
where we used y = Relu(o2) and o2 = b3 + ÂN j=1 w
2 j
h
j
. Similarly for the bias b3
∂Ed
∂b3 = 2(y � td)
∂yd
∂b3 + 2lb3 = 2(y � td)
∂Relu(o2) ∂o2
∂o2
∂b3 + 2lb3
= 2lb3 +
( 2(y � td) o2 � 0
0 o2 < 0 (9)
Bottom Layer: We have the N ⇥ 2 weights w1 ji
to update as follows
∂Ed
∂w1 ji
= 2(y � td) ∂yd
∂w1 ji
+ 2lw1 ji
(10)
= 2(y � td) ∂Relu(o2)
∂o2 ∂o2
∂h j
∂h j
∂o1 j
∂o1 j
∂w1 ji
+ 2lw1 ji
= 2(y � td) ∂Relu(o2)
∂o2 w
2 j
∂Relu(o1 j
)
∂o1 j
x
d
i
+ 2lw1 ji
= 2lw1 ji
+
( 2(y � td)w2
j
x
d
i
o
1 j
� 0 & o2 � 0 0 else
(11)
where we used y = Relu(o2) and o2 = b3 + ÂN j=1 w
2 j
h
j
, h j
= Relu(o1 j
), and o1 j
= b2 j
+
Â2 i=1 w
1 ji
x
d
i
. Similarly, for the bias at each hidden layer unit b2 j
we have
4
∂Ed
∂b2 j
= 2(y � td) ∂yd
∂b2 j
+ 2lb2 j
= 2(y � td) ∂Relu(o2)
∂o2 ∂o2
∂h j
∂h j
∂o1 j
∂o1 j
∂b2 j
+ 2lb2 j
= 2lb2 j
+ 2(y � td) ∂yd
∂b2 j
+ 2lb2 j
= 2(y � td) ∂Relu(o2)
∂o2 w
2 j
∂Relu(o1 j
)
∂o1 j
= 2lb2 j
+
( 2(y � td)w2
j
o
1 j
� 0 & o2 � 0 0 else
(12)
Algorithm
We have four hyper-parameters to estimate l (to bias the solution to small weights and not fit the data exactly. Remember, the goal is not training perfectly, but generalizing well), h (speed of learning), T (when to stop learning, so validation/generalization is good even if training error is not perfect). Clearly l and T work on the same problem at different aspects of it. N is the last one, the number of hidden layer units. This one we will specify on the homework.
The back propagation algorithm works by first (at step t = 0) initializing all the weights and bias at small random values (with zero mean).
Then we loop for t = 1, . . . T and each step we loop for the training data d = 1, . . . , D and for each training pair (xd, td) we compute the gradient equation 7 for each of the parameters (weights and bias) using (8), (12), (11) accordingly.
Visualization of the Algorithm
Plot a graph of the average loss El(w1, w2, b2) from (3) versus t. On the same graph, superimpose in a different color the plot of the average validation loss (5) versus t (here using data from the validation set).
These plots will help you choose the hyper-parameter T, the stoppage step where the validation loss does not decrease.
Finally, after the algorithm is finished (no more iterations), we also want to know the single value of the average generalization (6), using the generalization training set.
5
Homework: Two Cases
The first case to study is the well known XOR example. For this example, the hidden layer maybe simply made of two units, N = 2, and the input and target output are binary values.
There are only four different possible inputs/outputs scenarios, � x
1 = (0, 0), t1 = 0 � ,
� x
2 = (0, 1), t2 = 1 � , � x
3 = (1, 0), t3 = 1 � , � x
4 = (1, 1), t4 = 0 � . Thus, for the training set,
validation set and generalization set, for each step t we only have these four examples. Show the plots we described in visualization, show the final generalization loss, show
all the weights of the network (list them). How to show the final generalization output ? we should be able to run your forward code, with the final weights, and your general- ization set, and produce your final outputs.
The second example we ask you to run the multiplication function. The inputs are x1, x2 and they are real values. Well, let us restrict to multiplications of integers ? The target output should be t = x1 ⇥ x2 also an integer number. Note that it should work on negative numbers as well, i.e., x1, x2, t 2 Z. We may restrict further to x1, x2, t 2 (�M, M), say M=100. It is easy to generate a large test set, not so large validation set, and an even larger generalization set. Say D = 1, 000, V = 200 and G = 4, 000. Of course, make sure to include cases of multiplication by zero and by 1.
In this case, the hidden layer should be "large" to have a chance to work. Consider N = 20 (I am not sure it will work!). Display the same plots and values as the XOR problem and we should be able to run your forward code, with the final weights, and produce your final outputs.
6