Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Math Models for Creative Coders
  3. AI
  4. Gradient Descent
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Learning: Adapting the Weights
    • Cost-Gradient for each Weight
    • What does this Gradient Look Like?
    • How Does the NN Use this Gradient?
  • Here Comes the Rain Maths Again!
  • Gradient Descent in Code
  • References
  1. Teaching
  2. Math Models for Creative Coders
  3. AI
  4. Gradient Descent

Gradient Descent

Published

November 23, 2024

Modified

May 25, 2025

Learning: Adapting the Weights

We obtained the backpropagated error for each layer:

[e11e21e31]∼∼[W11W12W13W21W22W23W31W32W33]∗[e12e22e32]

And the matrix form:

(1)el−1 ∼∼ WlTT∗el

Now what? How do we use all these errors, from the output right up to those backpropagated backwards up to the first (l=1) layer? To adapt the weights of the NN using these backpropagated errors, here are the steps:

  1. Per-Weight Cost Gradient: We are looking for something like dCWjkdCWjk for all possible combos of jk.
  2. Learn: Adapt the Weights in the opposite direction to its Cost-Gradient. (Why?)

Are you ready? ;-D Let us do this !

Cost-Gradient for each Weight

  1. The cost function was the squared error averaged over all n neurons:

(2)C(W,b)=12n∑i=1n neuronse2(i)

  1. Serious Magic: We want to differentiate this sum for each Weight. Before we calculate dCdWjkl, we realize that any weight Wjkl connects only as input to one neuron k, which outputs ak. No other neuron-terms in the above summation depend upon this specific Weight, so the summation becomes just one term, pertaining to activation-output, say ak!

d Cd WjklWjkl=dd WjklWjkl(12n∑i=1all n neurons(ei)2 )=ekln ∗dd WjklWjkl (eklekl )  only  kth neuron lth layer=ekln ∗ dd WjklWjkl(akl−dkl)

  1. Now, the relationship between akl and Wjkl involves the sigmoid function. (And dk is not dependent upon anything!)

aklakl =σ (∑j=1neurons in l−1WjklWjkl ∗ ajl−1+bjl)=σ(everything)

  1. We also know dσ(x)dx=σ(x)∗(1−σ(x))

  2. Final Leap: Using the great chain rule for differentiation, we obtain:

(3)d Cd WjklWjkl=ekln ∗ dd WjklWjkl(akl−dkl)=ekln ∗ d aklakld WjklWjkl=ekln ∗d σ(everything)d WjklWjkl=ekln ∗σ(everything)∗(1−σ(everything))∗d(everything)d WjklWjkl  Applying Chain Rule!=ekln ∗ ajl−1∗ σ (∑j=1neurons in l−1WjklWjkl ∗ ajl−1+bjl)∗(1−σ (∑j=1neurons in l−1WjklWjkl ∗ ajl−1+bjl))=ekln ∗ ajl−1∗akl∗[1−akl]

Equation corrected by Adit Joshi and Ananya Krishnan, April 2025

How to understand this monster equation intuitively? Let us first draw a diagram to visualize the components:

Let us take the Weight Wjk. It connects neuron jl−1 with neuron kl, using the activation ajl−1. The relevant output error ( that contributes to the Cost function) is ekl.

  • The product ajl−1 ∗ ekl is like a correlation product of the two quanties at the input and output of the neuron k. This product contributes to a sense of slope: the larger either of these, larger is the Cost-slope going from neuron j to k.
  • How do we account for the magnitude of the Weight Wjk itself? Surely that matters! Yes, but note that Wjk is entwined with the remaining inputs and weights via the σ function term! We must differentiate that and put that differential into the product! That gives is the two other product terms in the formula above which involve the sigmoid function.

So, monster as it is, the formula is quite intuitive and even beautiful!

What does this Gradient Look Like?

This gradient is calculated (in vector fashion) for all weights.

How Does the NN Use this Gradient?

So now that we have the gradient of Cost vs Wjkl, we can adapt Wjkl by moving a small tuning step in the opposite direction:

(4)Wjkl | new=Wjkl | old−α∗gradient

and we adapt all weights in opposition to their individual cost gradient. The parameter α is called the learning rate.

Yes, but not all neurons have a desired output; so what do we use for error?? Only the output neurons have a desired output!!

The backpropagated error, peasants! Each neuron has already “received” its share of error, which is converted to Cost, whose gradient wrt all input weights of the specific neuron is calculated using Equation 3, and each weight thusly adapted using Equation 4.

Here Comes the Rain Maths Again!

Now, we are ready (maybe?) to watch these two very beautifully made videos on Backpropagation. One is of course from Dan Shiffman, and the other from Grant Sanderson a.k.a. 3Blue1Brown.

Gradient Descent in Code

  • Using p5.js
  • Using R

Using torch.

References

  1. Tariq Rashid. Make your own Neural Network. PDF Online
  2. Mathoverflow. Intuitive Crutches for Higher Dimensional Thinking. https://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking
  3. Interactive Backpropagation Explainer https://xnought.github.io/backprop-explainer/
Back to top
MLPs and Backpropagation
Projects

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .