6 Backpropagation
This is the heart of deep learning. Let’s demystify it.
6.1 The Chain Rule
Backpropagation is just the chain rule from calculus, applied systematically.
If \(z = f(y)\) and \(y = g(x)\), then:
\[\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}\]
That’s it. No magic.
6.2 A Simple Example
Consider:
x = 2.0
y = x ** 2 # y = 4
z = y * 3 # z = 12We want \(\frac{dz}{dx}\).
By chain rule:
\[\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} = 3 \cdot 2x = 3 \cdot 4 = 12\]
Backprop computes this backwards:
- Start: \(\frac{dz}{dz} = 1\)
- Through
*3: \(\frac{dz}{dy} = 3\) - Through
**2: \(\frac{dz}{dx} = 3 \cdot 2x = 12\)
6.3 Gradients for Each Operation
Each operation needs a backward function:
6.3.1 Addition: \(z = a + b\)
\[\frac{\partial z}{\partial a} = 1, \quad \frac{\partial z}{\partial b} = 1\]
Gradient just passes through unchanged.
6.3.2 Multiplication: \(z = a \times b\)
\[\frac{\partial z}{\partial a} = b, \quad \frac{\partial z}{\partial b} = a\]
Gradient is scaled by the other input.
6.3.3 Matrix Multiplication: \(Z = A @ B\)
\[\frac{\partial L}{\partial A} = \frac{\partial L}{\partial Z} @ B^T\] \[\frac{\partial L}{\partial B} = A^T @ \frac{\partial L}{\partial Z}\]
6.3.4 Power: \(z = x^n\)
\[\frac{\partial z}{\partial x} = n \cdot x^{n-1}\]
6.3.5 Mean: \(z = \text{mean}(x)\)
\[\frac{\partial z}{\partial x_i} = \frac{1}{n}\]
6.4 Implementing Backward
Let’s add a backward() method to our Tensor class:
class Tensor:
def backward(self):
"""Compute gradients via backpropagation."""
# Step 1: Topological sort (process nodes in correct order)
order = self._topological_sort()
# Step 2: Initialize gradient of output
self.grad = np.ones_like(self.data)
# Step 3: Propagate gradients backwards
for tensor in order:
if tensor.grad_fn is None:
continue # Leaf tensor
# Get gradient flowing into this tensor
grad_output = tensor.grad
# Compute gradients for each parent
grads = self._compute_gradients(tensor, grad_output)
# Accumulate gradients in parents
for parent, grad in zip(tensor.parents, grads):
if parent.requires_grad:
if parent.grad is None:
parent.grad = np.zeros_like(parent.data)
parent.grad += gradCode Reference: See src/tensorweaver/autodiff/tensor.py for the full backward implementation.
6.5 Topological Sort
We must process nodes in the right order — from output to input:
def _topological_sort(self):
"""Return tensors in reverse order for backprop."""
visited = set()
order = []
def dfs(tensor):
if id(tensor) in visited:
return
visited.add(id(tensor))
for parent in tensor.parents:
dfs(parent)
order.append(tensor)
dfs(self)
return order[::-1] # Reverse for backpropCode Reference: See src/tensorweaver/autodiff/topological_sort.py for the implementation.
6.6 Computing Gradients by Operation
def _compute_gradients(self, tensor, grad_output):
"""Compute gradients based on the operation."""
if tensor.grad_fn == 'add':
# d(a+b)/da = 1, d(a+b)/db = 1
return [grad_output, grad_output]
elif tensor.grad_fn == 'matmul':
a, b = tensor.parents
# d(a@b)/da = grad @ b.T
# d(a@b)/db = a.T @ grad
grad_a = grad_output @ b.data.T
grad_b = a.data.T @ grad_output
return [grad_a, grad_b]
elif tensor.grad_fn == 'pow':
parent = tensor.parents[0]
power = tensor._power # Stored during forward
# d(x^n)/dx = n * x^(n-1)
grad = power * (parent.data ** (power - 1)) * grad_output
return [grad]
elif tensor.grad_fn == 'mean':
parent = tensor.parents[0]
# d(mean(x))/dx = 1/n
grad = np.ones_like(parent.data) * grad_output / parent.data.size
return [grad]
# ... more operations ...6.7 Temperature Example: Full Backward Pass
Let’s trace through:
# Setup
celsius = Tensor([[0.0], [100.0]])
targets = Tensor([[32.0], [212.0]])
w = Tensor([[0.5]], requires_grad=True)
b = Tensor([10.0], requires_grad=True)
# Forward
pred = celsius @ w.T + b # [[10], [60]]
diff = pred - targets # [[-22], [-152]]
squared = diff ** 2 # [[484], [23104]]
loss = squared.mean() # 11794.0
# Backward
loss.backward()Step by step:
1. loss.grad = 1.0
2. Through mean():
squared.grad = 1.0 / 2 = 0.5 (for each element)
3. Through ** 2:
diff.grad = 0.5 * 2 * diff = [[-22], [-152]]
4. Through subtraction:
pred.grad = [[-22], [-152]]
5. Through addition:
matmul_out.grad = [[-22], [-152]]
b.grad = sum([[-22], [-152]]) = -174
6. Through matmul:
w.grad = celsius.T @ matmul_out.grad
= [[0, 100]] @ [[-22], [-152]]
= [[-15200]]
Final gradients:
w.grad = -15200(w should increase)b.grad = -174(b should increase)
6.8 Gradient Checking
Verify gradients numerically:
def gradient_check(param, loss_fn, eps=1e-5):
"""Check gradient using finite differences."""
param.grad = None
loss = loss_fn()
loss.backward()
analytic = param.grad.copy()
numeric = np.zeros_like(param.data)
for idx in np.ndindex(param.shape):
param.data[idx] += eps
loss_plus = loss_fn().data
param.data[idx] -= 2 * eps
loss_minus = loss_fn().data
param.data[idx] += eps # Restore
numeric[idx] = (loss_plus - loss_minus) / (2 * eps)
diff = np.abs(analytic - numeric).max()
print(f"Max difference: {diff}")
return diff < 1e-46.9 Why Accumulate Gradients?
We use += instead of =:
parent.grad += gradBecause a tensor might be used multiple times:
y = x + x # x used twice!
# dy/dx = 1 + 1 = 2Both paths contribute to the gradient.
6.10 Summary
Backpropagation:
- Topological sort — Order nodes from output to input
- Initialize — Set output gradient to 1
- Chain rule — Multiply incoming gradient by local gradient
- Accumulate — Add gradients when tensor is reused
No magic — just the chain rule, automated.
Next: Using gradients to actually train the model!