Backpropagation: The Relay Race That Taught Machines to Learn
तमसो मा ज्योतिर्गमय — from avidyā to jñāna.
Let’s face it: deep learning sounds like wizardry. Massive neural networks, trained on mountains of data, doing things that look suspiciously like magic. But behind all the hype and jargon lies an algorithm so elegant — and so grounded in common sense — it’s almost anticlimactic.
That algorithm is backpropagation.
Today, we’ll unpack what backprop really is, where it came from, and how it works in plain English and enough math to keep things honest. The metaphor we’ll use: a relay race.
If You Have No Idea What Deep Learning Is
Imagine you have a pile of labeled examples:
- Input: a photo of a cat wearing sunglasses → Output: Yes
- Input: a photo of a dog → Output: No
- Input: handwritten squiggles → Output: the digit 5
Deep learning builds a system — a model — that learns the mapping from input to output automatically, without hand-coded rules. Show it enough examples, and it figures out the rules itself. That’s why it’s powerful.
But none of this works without a way to learn from mistakes. To see that it guessed wrong, and adjust itself to be more right next time.
That’s exactly what backpropagation is.
What Is Backpropagation?
Backpropagation (short for backward propagation of errors) is the algorithm that makes neural networks learn. It figures out which part of the decision-making process messed up — and by how much.
Imagine a relay team that keeps finishing last. After each race, the coach doesn’t just yell “run faster.” Instead, he reviews the replay and tells each runner exactly what they could do better: start quicker, pass the baton smoother, sprint harder at the finish.
Backpropagation is that replay analysis and targeted coaching.
A Brief History (The Ancient Scroll)
The idea first appeared in Paul Werbos’s 1974 Harvard dissertation. For years, it sat largely unnoticed — like inventing rock and roll before anyone had built a guitar.
It was rediscovered and popularized by Rumelhart, Hinton, and Williams in 1986:
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
Once rediscovered, it transformed neural networks from academic curiosities into tools that could learn complex patterns from data.
What Problem Did It Solve?
Before backprop, training multilayer networks was like improving a relay team by only coaching the last runner.
You could tweak the output layer. But there was no systematic way to improve the hidden layers — the intermediate runners who do most of the real work — so they often remained random and unhelpful.
Backpropagation solved this by showing exactly how errors flowed backward through every connection in the network. Now every runner could learn together.
How It Works: The Non-Math Way
Here’s the big picture:
- Forward Pass — Your runners complete the race. The network makes a prediction.
- Compute Error — You look at the finish time. How far off were you?
- Backward Pass — You watch the replay and figure out who contributed most to the slowdown.
- Gradients — You give each runner personalized advice: adjustments precise to their contribution.
- Update — Each runner tweaks their approach slightly for the next race.
- Iterate — Repeat thousands of times. The team steadily improves.
How It Works: The Math
Now let’s make the metaphor precise.
The Network
Consider a three-layer network: input layer (green / Runner 1), hidden layer (pink / Runner 2), output layer (blue / Runner 3).
Each layer performs two operations — a linear transformation followed by a nonlinear activation:
Where:
- — weight matrix for layer (how hard each runner pushes)
- — bias vector (starting advantage or disadvantage)
- — activation function, e.g. ReLU: , or sigmoid:
- — activations (the baton being passed forward)
For a full forward pass through all layers:
Step 2: Compute the Error (Loss)
Once you have the prediction , you measure how wrong it is. For regression, the standard choice is Mean Squared Error:
The is purely cosmetic — it cancels cleanly when you differentiate. For classification, you’d use cross-entropy loss:
Step 3: The Backward Pass — The Chain Rule
Here’s the core insight. The loss depends on , which depends on , which depends on and , which depends on the layer before it… and so on, all the way back to the input.
To know how much weight contributed to the error, we apply the chain rule of calculus:
We define the error signal at each layer — how much that layer contributed to the final loss:
For the output layer:
For a hidden layer, error propagates backward through the weights of the next layer:
Where is element-wise multiplication and is the derivative of the activation function.
This is the relay race in reverse — blame passing backward from the last runner to the first.
Step 4: Gradients — Personalized Coaching
Using the error signal , the gradients for each weight matrix and bias are:
These gradients are the precise instructions: “this weight contributed this much to the error.”
Step 5: Weight Update — Gradient Descent
You take the gradients and nudge every weight in the direction that reduces the loss. This is gradient descent:
Where (eta) is the learning rate — how big a step you take. Too large: you overshoot. Too small: you never arrive.
The full algorithm, compactly:
Step 6: Iteration
You repeat this loop — forward pass, compute loss, backward pass, update — across thousands or millions of training examples.
In stochastic gradient descent (SGD), you update after each single example. In mini-batch gradient descent, you average the gradients over a small batch before updating — a balance between speed and stability.
Over time, the loss surface looks like a hilly landscape, and gradient descent is walking downhill. Each step is small. Each step is guided. The network converges.
The Complete Picture
| Step | Relay Metaphor | Math |
|---|---|---|
| Forward pass | Runners complete the race | per layer |
| Compute error | Check finish time | |
| Backward pass | Replay analysis | Chain rule: |
| Gradients | Personalized coaching | |
| Update | Adjust technique | |
| Iterate | Race again | Repeat until convergence |
Final Thought
Every architecture you’ve heard of — CNNs, RNNs, LSTMs, Transformers — is a relay team trained with backpropagation. GPT itself runs this same loop, scaled up with more layers, more data, and more compute.
Backpropagation may sound intimidating, but at its core it’s just one loop: run, measure, blame, improve. Repeated relentlessly until the network gets it right.
But here’s the twist: your brain doesn’t actually work this way.
Neuroscientists believe biological learning uses different, more mysterious mechanisms — local signals, messy feedback, temporal dynamics, perhaps something we haven’t discovered yet. Backpropagation is a brilliant engineering solution. It is not how nature solved the same problem.
So if you’re feeling inspired: the real race is figuring out how nature pulled it off.
References
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
- Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. dissertation, Harvard University.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Leave a thought