Gradient Descent
Learning: Adapting the Weights
We obtained the backpropagated error for each layer:
\[ \begin{bmatrix} e_{11}\\ e_{21}\\ e_{31}\\ \end{bmatrix} \pmb{\sim} \begin{bmatrix} W_{11} & W_{12} & W_{13} \\ W_{21} & W_{22} & W_{23} \\ W_{31} & W_{32} & W_{33} \\ \end{bmatrix} * \begin{bmatrix} {e_{12}}\\ {e_{22}}\\ {e_{32}}\\ \end{bmatrix} \]
And the matrix form:
\[ e^{l-1} ~ \pmb{\sim} ~ {W^l}^{\pmb{\color{red}{T}}}* e^{l} \tag{1}\]
Now what? How do we use all these errors, from the output right up to those backpropagated backwards up to the first (\(l=1\)) layer? To adapt the weights of the NN using these backpropagated errors, here are the steps:
- Per-Weight Cost Gradient: We are looking for something like \(\large{\pmb{\color{red}{\frac{dC}{W_{jk}}}}}\) for all possible combos of \(jk\).
- Learn: Adapt the Weights in the opposite direction to its Cost-Gradient. (Why?)
Are you ready? ;-D Let us do this !
Cost-Gradient for each Weight
- The cost function was the squared error averaged over all \(n\) neurons:
\[ \begin{align} C(W, b) &= \frac{1}{2n}\sum^{n ~ neurons}_{i=1}e^2(i)\\ \end{align} \tag{2}\]
- Serious Magic: We want to differentiate this sum for each Weight. Before we calculate \(\frac{dC}{dW^l_{jk}}\), we realize that any weight \(W^l_{jk}\) connects only as input to one neuron \(k\), which outputs \(a_k\). No other neuron-terms in the above summation depend upon this specific Weight, so the summation becomes just one term, pertaining to activation-output, say \(a_k\)!
\[ \begin{align} \frac{d~C}{d~\color{orange}{\pmb{W^l_{jk}}}} &= \large\frac{d}{d~\color{orange}{\pmb{W^l_{jk}}}}\Bigg({\frac{1}{2n}\sum^{all~n~neurons}_{i=1}(e_i)^2}~\Bigg)\\ \\ &= \frac{\color{skyblue}{\large{e^l_k}} }{n} ~ * \frac{d}{d~\color{orange}{\pmb{W^l_{jk}}}} ~ \Bigg(\pmb{\color{red}{\Large{{e^{l}_k}}}} ~ \Bigg) ~~only~~k^{th}~neuron~l^{th}~layer\\ \\ &= \frac{\color{skyblue}{\large{e^l_k}} }{n} ~ * ~ {\frac{d}{d~\color{orange}{\pmb{W^l_{jk}}}}\bigg(\Large{\color{red}{a^{l}_k - d^l_k}}}\bigg) \end{align} \]
- Now, the relationship between \(a^{l}_k\) and \(W^l_{jk}\) involves the sigmoid function. (And \(d_k\) is not dependent upon anything!)
\[ \begin{align} \color{red}{\pmb{a^l_k}} ~ &= \sigma~\bigg(\sum^{neurons~in~l-1}_{j=1} \pmb{\color{orange}{W^l_{jk}}} ~ * ~{a^{l-1}_j + b^l_j}\bigg)\\ &= \color{red}{\sigma(everything)}\\ \end{align} \]
We also know \[ \large{\frac{d\sigma(x)}{dx}} = \sigma(x) * \big(1 - \sigma(x)\big) \]
Final Leap: Using the great chain rule for differentiation, we obtain:
\[ \begin{align} \frac{d~C}{d~\color{orange}{\pmb{W^l_{jk}}}} &= \frac{\color{skyblue}{\large{e^l_k}} }{n} ~ * ~ {\frac{d}{d~\color{orange}{\pmb{W^l_{jk}}}}\bigg(\Large{\color{red}{a^{l}_k - d^l_k}}}\bigg)\\ &= \frac{\color{skyblue}{\large{e^l_k}} }{n} ~ * ~\frac{d~\color{red}{\pmb{a^l_k}}}{d~\color{orange}{\pmb{W^l_{jk}}}}\\ &= \frac{\color{skyblue}{\large{e^l_k}} }{n} ~ *\frac{d~ \color{red}{\sigma(everything)}}{d~\color{orange}{\pmb{W^l_{jk}}}}\\ \\ &= \frac{\color{skyblue}{\large{e^l_k}} }{n} ~ * \sigma(everything) * (1 -\sigma(everything)) * \frac{d(everything)}{d~\color{orange}{\pmb{W^l_{jk}}}}~~ \text{Applying Chain Rule!}\\ &= \huge{\frac{\color{skyblue}{\large{e^l_k}} }{n} ~ * \color{red}{~a^{l-1}_k} * ~\\ \large{\sigma~\bigg(\sum^{neurons~in~l-1}_{j=1} \pmb{\color{orange}{W^l_{jk}}} ~ * ~ {a^{l-1}_j + b^l_j}\bigg) * \\ \bigg(1 - \sigma~\bigg(\sum^{neurons~in~l-1}_{j=1} \pmb{\color{orange}{W^l_{jk}}} ~ * ~ {a^{l-1}_j + b^l_j}\bigg)\bigg)}} \end{align} \tag{3}\]
How to understand this monster equation intuitively? Let us first draw a diagram to visualize the components:
Let us take the Weight \(Wjk\). It connects neuron \(j^{l-1}\) with neuron \(k^l\), using the activation \(a^{l-1}_j\). The relevant output error ( that contributes to the Cost function) is \(e^l_{k}\).
- The product \(\large{\color{red}{a^{l-1}_j} ~ * ~ \color{lightblue}{e^l_k}}\) is like a correlation product of the two quanties at the input and output of the neuron \(k\). This product contributes to a sense of slope: the larger either of these, larger is the Cost-slope going from neuron \(j\) to \(k\).
- How do we account for the magnitude of the Weight \(Wjk\) itself? Surely that matters! Yes, but note that \(Wjk\) is entwined with the remaining inputs and weights via the \(\sigma\) function term! We must differentiate that and put that differential into the product! That gives is the two other product terms in the formula above which involve the sigmoid function.
So, monster as it is, the formula is quite intuitive and even beautiful!
What does this Gradient Look Like?
This gradient is calculated (in vector fashion) for all weights.
How Does the NN Use this Gradient?
So now that we have the gradient of Cost vs \(W^l_{jk}\), we can adapt \(W^l_{jk}\) by moving a small tuning step in the opposite direction:
\[ W^l_{jk}~|~new = W^l_{jk}~|~old - \alpha * gradient \tag{4}\]
and we adapt all weights in opposition to their individual cost gradient. The parameter \(\alpha\) is called the learning rate.
Yes, but not all neurons have a desired output; so what do we use for error?? Only the output neurons have a desired output!!
The backpropagated error, peasants! Each neuron has already “received” its share of error, which is converted to Cost, whose gradient wrt all input weights of the specific neuron is calculated using Equation 3, and each weight thusly adapted using Equation 4.
Here Comes the Rain Maths Again!
Now, we are ready (maybe?) to watch these two very beautifully made videos on Backpropagation. One is of course from Dan Shiffman, and the other from Grant Sanderson a.k.a. 3Blue1Brown.
Gradient Descent in Code
Using torch
.
References
- Tariq Rashid. Make your own Neural Network. PDF Online
- Mathoverflow. Intuitive Crutches for Higher Dimensional Thinking. https://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking
- Interactive Backpropagation Explainer https://xnought.github.io/backprop-explainer/