MLPs and Backpropagation
How does an MLP Learn?
We saw how each layer works:
\[ \begin{bmatrix} a_{12}\\ a_{22}\\ a_{32}\\ \end{bmatrix} = sigmoid~\Bigg( \begin{bmatrix} \color{red}{W^2_{11}} & \color{skyblue}{W^2_{21}} & \color{forestgreen}{W^2_{31}}\\ W^2_{12} & W^2_{22} & W^2_{32}\\ W^2_{13} & W^2_{23} & W^2_{33}\\ \end{bmatrix} * \begin{bmatrix} \color{red}{a_{11}}\\ \color{skyblue}{a_{21}}\\ \color{forestgreen}{a_{31}}\\ \end{bmatrix} + \begin{bmatrix} b_{12}\\ b_{22}\\ b_{32}\\ \end{bmatrix} \Bigg) \tag{1}\]
and:
\[ A^l = \sigma\Bigg(W^lA^{l-1} + B^l\Bigg) \tag{2}\]
See how the connections between neurons are marked by weights: these multiply the signal from the previous neuron. The multiplied/weighted products are added up in the neuron, and the sum is given to the activation block therein.
So learning?
The only controllable variables in a neural network are these weights! So learning involves adapting these weights so that they can perform a useful function.
What is the Learning Process?
The process of adapting the weights of a neural network can be described in the following steps:
- Training Set: Training is over several known input-output pairs (“training data”)
- Training Epoch: For each input, the signals propagate forward until we have an output
- Error Calculation: Output is compared with desired output, to calculate error
- Backpropagation: Each neuron (and its weights) need to be told what is their share of the error! Errors therefore need to be sent backward from the output to input, unravelling them from layer \(l\) to layer \(l-1\). (like apportioning blame !!).
- Error-to-Cost: How does error at any given neuron relate to the idea of an overall Cost function? Is the Cost function also apportioned in the same way?
-
Differentiate: Evaluate the effect of each weight/bias on
the (apportioned) erroroverall Cost. (Slope!!) - Gradient Descent: Adapt the weights/biases with a small step in the opposite direction to the slope.
There.
What is the Output Error?
If \(d(k)\) are the desired outputs of the NN (over an entire training set), and \(y(k)\) are the outputs of the output layer, then we calculate the error at the outputs of the NN as:
\[ e(k) = a(k) - d(k) \tag{3}\]
This error is calculated at each output for each training epoch/sample/batch. (More about the batch-mode in a bit.)
What is the Cost Function?
We define the cost or objective function as the squared error averaged over all neurons:
\[ \begin{align} C(W, b) &= \frac{1}{2n}\sum^{n ~ neurons}_{i=1}e^2(i)\\ \\ &= \frac{1}{2n}\sum^{n~neurons}_{k=1}(a_i - d_i)^2 \end{align} \tag{4}\]
The \(a_i\)s are the outputs of \(n\) neurons and \(d_i\) are the desired outputs for each of the training samples.
The Cost Function is of course dependent upon the Weights and the biases, and is to be minimized by adapting these. Using the sum of squared errors, along with the linear operations in the NN guarantees that the Cost Function (usually) has one global, minimum.
What is Backpropagation of Error?
As we stated earlier, error is calculated at the output. In order to adapt all weights, we need to send error proportionately back along the network, towards the input. This proportional error will enable will give us a basis to adapt the individual weights anywhere in the network.
What does “proportional” mean here? Consider the diagrams below:
\[ \begin{align} e_{11} ~~\pmb\sim~~ ~ e_{12} * \frac{W_{11}}{Sum~of~Weights~to~{\color{magenta}{\pmb{h_1}}}}\\ e_{21} ~~\pmb\sim~~ ~ e_{12} * \frac{W_{21}}{Sum~of~Weights~to~{\color{magenta}{\pmb{h_1}}}} \\ e_{31}~~\pmb\sim~~ ~ e_{12} * \frac{W_{31}}{Sum~of~Weights~to~\color{magenta}{\pmb{h_1}}} \\ \end{align} \]
\[ \begin{align} e_{11} ~~\pmb\sim~~ ~ e_{12} * \frac{W_{11}}{\pmb{\color{magenta}{W_{11} + W_{21} + W_{31}}}} \\ e_{21} ~~\pmb\sim~~ ~ e_{12} *\frac{W_{21}}{\pmb{\color{magenta}{W_{11} + W_{21} + W_{31}}}} \\ e_{31} ~~\pmb\sim~~ ~ e_{12} *\frac{W_{31}}{\pmb{\color{magenta}{W_{11} + W_{21} + W_{31}}}} \\ \end{align} \]
Another way of looking at this:
\[ \begin{align} e_{11} =~ e_{21} * \frac{W_{11}}{Sum~of~weights~to~{\color{orange}{\pmb {h_1}}}}\\ + ~ e_{22} * \frac{W_{21}}{Sum~of~Weights~to~\color{pink}{\pmb{h_2}}} \\ + ~ e_{23} * \frac{W_{31}}{Sum~of~Weights~to~\color{teal}{\pmb{h_3}}} \\ \end{align} \]
\[ \begin{align} e_{11} = ~ e_{12} * \frac{W_{11}}{\pmb{\color{orange}{W_{11} + W_{21} + W_{31}}}}\\ + e_{22} * \frac{W_{12}}{\pmb{\color{pink}{W_{12} + W_{22} + W_{32}}}} \\ + e_{32} * \frac{W_{13}}{\pmb{\color{teal}{W_{13} + W_{23} + W_{33}}}} \\ \end{align} \]
\[ \begin{align} e_{12} = similar~expression!!\\ \ e_{13} = similar~expression!!\\ \end{align} \]
We have taken one output error, \(e_{12}\) and parcelled it back to the preceding neurons in proportion to their connecting Weights. This makes intuitive sense; we are making those neurons put their money where their mouth is. As Nassim Nicholas Taleb says, people (and neurons!) need to pay for their opinions, especially when things go wrong!
The accumulated error at each neuron in layer \(l-1\) is the weighted sum of back-propagated error contributions from all layer \(l\) neurons to which we are connected.
So we can compactly write the relationships above as:
\[ \begin{bmatrix} e_{11}\\ e_{21}\\ e_{31}\\ \end{bmatrix} = \Bigg( \begin{bmatrix} \frac{W_{11}}{D_{11}} & \frac{W_{12}}{D_{12}} & \frac{W_{13}}{D_{13}}\\ \frac{W_{21}}{D_{21}} & \frac{W_{22}}{D_{22}} & \frac{W_{23}}{D_{23}}\\ \frac{W_{31}}{D_{31}} & \frac{W_{32}}{D_{32}} & \frac{W_{33}}{D_{33}}\\ \end{bmatrix} * \begin{bmatrix} {e_{12}}\\ {e_{22}}\\ {e_{32}}\\ \end{bmatrix} \Bigg) \]
The denominators make things look complicated! But if we are able to simply ignore them for a moment, then we see a very interesting thing:
\[ \begin{bmatrix} e_{11}\\ e_{21}\\ e_{31}\\ \end{bmatrix} \pmb{\sim} \begin{bmatrix} W_{11} & W_{12} & W_{13} \\ W_{21} & W_{22} & W_{23} \\ W_{31} & W_{32} & W_{33} \\ \end{bmatrix} * \begin{bmatrix} {e_{12}}\\ {e_{22}}\\ {e_{32}}\\ \end{bmatrix} \]
This new approximate matrix is the tranpose of our original Weight matrix from Equation 1! The rows there have become columns here!!
\[ e^{l-1} ~ \pmb{\sim} ~ {W^l}^{\pmb{\color{red}{T}}}* e^{l} \tag{5}\]
This is our equation for backpropagation of error.
Why is ignoring all those individual denominators justified? Let us park that question until we have understood the one last step in NN training, the Gradient Descent.
Here Comes the Rain Maths Again!
Now, we are ready (maybe?) to watch these two very beautifully made videos on Backpropagation. One is of course from Dan Shiffman, and the other from Grant Sanderson a.ka. 3Blue1Brown.
Backpropagation in Code
Using torch
.
References
- Tariq Rashid. Make your own Neural Network. PDF Online
- Mathoverflow. Intuitive Crutches for Higher Dimensional Thinking. https://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking
- Interactive Backpropagation Explainer https://xnought.github.io/backprop-explainer/