← back to thoughts

Backpropagation: The Relay Race That Taught Machines to Learn

तमसो मा ज्योतिर्गमय — from avidyā to jñāna.

Let’s face it: deep learning sounds like wizardry. Massive neural networks, trained on mountains of data, doing things that look suspiciously like magic. But behind all the hype and jargon lies an algorithm so elegant — and so grounded in common sense — it’s almost anticlimactic.

That algorithm is backpropagation.

Today, we’ll unpack what backprop really is, where it came from, and how it works in plain English and enough math to keep things honest. The metaphor we’ll use: a relay race.


If You Have No Idea What Deep Learning Is

Imagine you have a pile of labeled examples:

  • Input: a photo of a cat wearing sunglasses → Output: Yes
  • Input: a photo of a dog → Output: No
  • Input: handwritten squiggles → Output: the digit 5

Deep learning builds a system — a model — that learns the mapping from input to output automatically, without hand-coded rules. Show it enough examples, and it figures out the rules itself. That’s why it’s powerful.

But none of this works without a way to learn from mistakes. To see that it guessed wrong, and adjust itself to be more right next time.

That’s exactly what backpropagation is.


What Is Backpropagation?

Backpropagation (short for backward propagation of errors) is the algorithm that makes neural networks learn. It figures out which part of the decision-making process messed up — and by how much.

Imagine a relay team that keeps finishing last. After each race, the coach doesn’t just yell “run faster.” Instead, he reviews the replay and tells each runner exactly what they could do better: start quicker, pass the baton smoother, sprint harder at the finish.

Backpropagation is that replay analysis and targeted coaching.


A Brief History (The Ancient Scroll)

The idea first appeared in Paul Werbos’s 1974 Harvard dissertation. For years, it sat largely unnoticed — like inventing rock and roll before anyone had built a guitar.

It was rediscovered and popularized by Rumelhart, Hinton, and Williams in 1986:

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

Once rediscovered, it transformed neural networks from academic curiosities into tools that could learn complex patterns from data.


What Problem Did It Solve?

Before backprop, training multilayer networks was like improving a relay team by only coaching the last runner.

You could tweak the output layer. But there was no systematic way to improve the hidden layers — the intermediate runners who do most of the real work — so they often remained random and unhelpful.

Backpropagation solved this by showing exactly how errors flowed backward through every connection in the network. Now every runner could learn together.


How It Works: The Non-Math Way

Here’s the big picture:

  1. Forward Pass — Your runners complete the race. The network makes a prediction.
  2. Compute Error — You look at the finish time. How far off were you?
  3. Backward Pass — You watch the replay and figure out who contributed most to the slowdown.
  4. Gradients — You give each runner personalized advice: adjustments precise to their contribution.
  5. Update — Each runner tweaks their approach slightly for the next race.
  6. Iterate — Repeat thousands of times. The team steadily improves.

How It Works: The Math

Now let’s make the metaphor precise.

The Network

Consider a three-layer network: input layer (green / Runner 1), hidden layer (pink / Runner 2), output layer (blue / Runner 3).

Each layer performs two operations — a linear transformation followed by a nonlinear activation:

z(l)=W(l)a(l1)+b(l)z^{(l)} = W^{(l)} \, a^{(l-1)} + b^{(l)} a(l)=σ ⁣(z(l))a^{(l)} = \sigma\!\left(z^{(l)}\right)

Where:

  • W(l)W^{(l)} — weight matrix for layer ll (how hard each runner pushes)
  • b(l)b^{(l)} — bias vector (starting advantage or disadvantage)
  • σ\sigma — activation function, e.g. ReLU: σ(z)=max(0,z)\sigma(z) = \max(0, z), or sigmoid: σ(z)=11+ez\sigma(z) = \dfrac{1}{1+e^{-z}}
  • a(l)a^{(l)} — activations (the baton being passed forward)

For a full forward pass through all layers:

y^=a(L)=σ ⁣(W(L)σ ⁣(W(2)σ ⁣(W(1)x+b(1))+b(2))+b(L))\hat{y} = a^{(L)} = \sigma\!\left(W^{(L)} \cdots \sigma\!\left(W^{(2)} \sigma\!\left(W^{(1)} x + b^{(1)}\right) + b^{(2)}\right) \cdots + b^{(L)}\right)

Step 2: Compute the Error (Loss)

Once you have the prediction y^\hat{y}, you measure how wrong it is. For regression, the standard choice is Mean Squared Error:

L=12(y^y)2\mathcal{L} = \frac{1}{2}\left(\hat{y} - y\right)^2

The 12\frac{1}{2} is purely cosmetic — it cancels cleanly when you differentiate. For classification, you’d use cross-entropy loss:

L=cyclogy^c\mathcal{L} = -\sum_{c} y_c \log \hat{y}_c

Step 3: The Backward Pass — The Chain Rule

Here’s the core insight. The loss L\mathcal{L} depends on y^\hat{y}, which depends on z(L)z^{(L)}, which depends on W(L)W^{(L)} and a(L1)a^{(L-1)}, which depends on the layer before it… and so on, all the way back to the input.

To know how much weight W(l)W^{(l)} contributed to the error, we apply the chain rule of calculus:

LW(l)=La(l)a(l)z(l)z(l)W(l)\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}

We define the error signal δ(l)\delta^{(l)} at each layer — how much that layer contributed to the final loss:

δ(l)=Lz(l)\delta^{(l)} = \frac{\partial \mathcal{L}}{\partial z^{(l)}}

For the output layer:

δ(L)=Ly^σ ⁣(z(L))\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \sigma'\!\left(z^{(L)}\right)

For a hidden layer, error propagates backward through the weights of the next layer:

δ(l)=(W(l+1))δ(l+1)σ ⁣(z(l))\delta^{(l)} = \left(W^{(l+1)}\right)^{\top} \delta^{(l+1)} \odot \sigma'\!\left(z^{(l)}\right)

Where \odot is element-wise multiplication and σ\sigma' is the derivative of the activation function.

This is the relay race in reverse — blame passing backward from the last runner to the first.


Step 4: Gradients — Personalized Coaching

Using the error signal δ(l)\delta^{(l)}, the gradients for each weight matrix and bias are:

LW(l)=δ(l)(a(l1))\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} \left(a^{(l-1)}\right)^{\top} Lb(l)=δ(l)\frac{\partial \mathcal{L}}{\partial b^{(l)}} = \delta^{(l)}

These gradients are the precise instructions: “this weight contributed this much to the error.”


Step 5: Weight Update — Gradient Descent

You take the gradients and nudge every weight in the direction that reduces the loss. This is gradient descent:

W(l)W(l)ηLW(l)W^{(l)} \leftarrow W^{(l)} - \eta \, \frac{\partial \mathcal{L}}{\partial W^{(l)}} b(l)b(l)ηLb(l)b^{(l)} \leftarrow b^{(l)} - \eta \, \frac{\partial \mathcal{L}}{\partial b^{(l)}}

Where η\eta (eta) is the learning rate — how big a step you take. Too large: you overshoot. Too small: you never arrive.

The full algorithm, compactly:

θθηθL(θ)\theta \leftarrow \theta - \eta \, \nabla_\theta \mathcal{L}(\theta)

Step 6: Iteration

You repeat this loop — forward pass, compute loss, backward pass, update — across thousands or millions of training examples.

In stochastic gradient descent (SGD), you update after each single example. In mini-batch gradient descent, you average the gradients over a small batch before updating — a balance between speed and stability.

Over time, the loss surface looks like a hilly landscape, and gradient descent is walking downhill. Each step is small. Each step is guided. The network converges.


The Complete Picture

StepRelay MetaphorMath
Forward passRunners complete the racey^=σ(Wx+b)\hat{y} = \sigma(Wx + b) per layer
Compute errorCheck finish timeL=12(y^y)2\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2
Backward passReplay analysisChain rule: δ(l)=(W(l+1))δ(l+1)σ(z(l))\delta^{(l)} = (W^{(l+1)})^\top \delta^{(l+1)} \odot \sigma'(z^{(l)})
GradientsPersonalized coachingLW(l)=δ(l)(a(l1))\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^\top
UpdateAdjust techniqueWWηWLW \leftarrow W - \eta \nabla_W \mathcal{L}
IterateRace againRepeat until convergence

Final Thought

Every architecture you’ve heard of — CNNs, RNNs, LSTMs, Transformers — is a relay team trained with backpropagation. GPT itself runs this same loop, scaled up with more layers, more data, and more compute.

Backpropagation may sound intimidating, but at its core it’s just one loop: run, measure, blame, improve. Repeated relentlessly until the network gets it right.

But here’s the twist: your brain doesn’t actually work this way.

Neuroscientists believe biological learning uses different, more mysterious mechanisms — local signals, messy feedback, temporal dynamics, perhaps something we haven’t discovered yet. Backpropagation is a brilliant engineering solution. It is not how nature solved the same problem.

So if you’re feeling inspired: the real race is figuring out how nature pulled it off.


References

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
  • Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. dissertation, Harvard University.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
For the Lazy
The summary you actually wanted
01
The Engine
Deep learning sounds like wizardry. But behind the hype lies an algorithm so grounded in common sense, it's almost anticlimactic.
Backpropagation. That's it. That's the whole trick.
Backpropagation
/ˌbakprɒpəˈɡeɪʃ(ə)n/
The method for telling every part of a neural network exactly how much it contributed to the error — so every part can improve. Not just the last layer. Every layer.
"
A relay team that keeps finishing last. The coach doesn't yell 'run faster.' He reviews the replay and gives every runner personalized feedback.
That replay analysis is backpropagation.
1974
First Proposed
Paul Werbos's Harvard dissertation. Sat unnoticed for a decade. Like inventing rock and roll before anyone had built a guitar.
Before Backprop
Stuck
You could only train the output layer. Hidden layers stayed random. Multilayer networks were a dead end.
After Backprop
Deep
Error flows backward through every connection. Every neuron learns together. Deep networks became practical.
The 6-step learning loop
1. Forward Pass
Run the race. Each layer transforms the input toward a prediction.
2. Compute Error
Measure the gap between prediction and truth.
3. Backward Pass
Trace blame through the network via the chain rule.
4. Gradients
Compute personalized correction instructions for every weight.
5. Update
Nudge each weight in the direction that reduces error.
6. Iterate
Repeat across millions of examples until convergence.
The Twist
Your brain doesn't work this way.
Neuroscientists believe biological learning uses local signals, messy feedback — mechanisms we haven't fully discovered. Backpropagation is the warm-up lap.
Run. Measure. Blame. Improve.
Repeat.
The real race is figuring out how nature pulled off intelligence without any of this.