6  Backpropagation

This is the heart of deep learning. Let’s demystify it.

6.1 The Chain Rule

Backpropagation is just the chain rule from calculus, applied systematically.

If \(z = f(y)\) and \(y = g(x)\), then:

\[\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}\]

That’s it. No magic.

6.2 A Simple Example

Consider:

x = 2.0
y = x ** 2      # y = 4
z = y * 3       # z = 12

We want \(\frac{dz}{dx}\).

By chain rule:

\[\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} = 3 \cdot 2x = 3 \cdot 4 = 12\]

Backprop computes this backwards:

  1. Start: \(\frac{dz}{dz} = 1\)
  2. Through *3: \(\frac{dz}{dy} = 3\)
  3. Through **2: \(\frac{dz}{dx} = 3 \cdot 2x = 12\)

6.3 Gradients for Each Operation

Each operation needs a backward function:

6.3.1 Addition: \(z = a + b\)

\[\frac{\partial z}{\partial a} = 1, \quad \frac{\partial z}{\partial b} = 1\]

Gradient just passes through unchanged.

6.3.2 Multiplication: \(z = a \times b\)

\[\frac{\partial z}{\partial a} = b, \quad \frac{\partial z}{\partial b} = a\]

Gradient is scaled by the other input.

6.3.3 Matrix Multiplication: \(Z = A @ B\)

\[\frac{\partial L}{\partial A} = \frac{\partial L}{\partial Z} @ B^T\] \[\frac{\partial L}{\partial B} = A^T @ \frac{\partial L}{\partial Z}\]

6.3.4 Power: \(z = x^n\)

\[\frac{\partial z}{\partial x} = n \cdot x^{n-1}\]

6.3.5 Mean: \(z = \text{mean}(x)\)

\[\frac{\partial z}{\partial x_i} = \frac{1}{n}\]

6.4 Implementing Backward

Let’s add a backward() method to our Tensor class:

class Tensor:
    def backward(self):
        """Compute gradients via backpropagation."""

        # Step 1: Topological sort (process nodes in correct order)
        order = self._topological_sort()

        # Step 2: Initialize gradient of output
        self.grad = np.ones_like(self.data)

        # Step 3: Propagate gradients backwards
        for tensor in order:
            if tensor.grad_fn is None:
                continue  # Leaf tensor

            # Get gradient flowing into this tensor
            grad_output = tensor.grad

            # Compute gradients for each parent
            grads = self._compute_gradients(tensor, grad_output)

            # Accumulate gradients in parents
            for parent, grad in zip(tensor.parents, grads):
                if parent.requires_grad:
                    if parent.grad is None:
                        parent.grad = np.zeros_like(parent.data)
                    parent.grad += grad
Note

Code Reference: See src/tensorweaver/autodiff/tensor.py for the full backward implementation.

6.5 Topological Sort

We must process nodes in the right order — from output to input:

def _topological_sort(self):
    """Return tensors in reverse order for backprop."""
    visited = set()
    order = []

    def dfs(tensor):
        if id(tensor) in visited:
            return
        visited.add(id(tensor))
        for parent in tensor.parents:
            dfs(parent)
        order.append(tensor)

    dfs(self)
    return order[::-1]  # Reverse for backprop
Note

Code Reference: See src/tensorweaver/autodiff/topological_sort.py for the implementation.

6.6 Computing Gradients by Operation

def _compute_gradients(self, tensor, grad_output):
    """Compute gradients based on the operation."""

    if tensor.grad_fn == 'add':
        # d(a+b)/da = 1, d(a+b)/db = 1
        return [grad_output, grad_output]

    elif tensor.grad_fn == 'matmul':
        a, b = tensor.parents
        # d(a@b)/da = grad @ b.T
        # d(a@b)/db = a.T @ grad
        grad_a = grad_output @ b.data.T
        grad_b = a.data.T @ grad_output
        return [grad_a, grad_b]

    elif tensor.grad_fn == 'pow':
        parent = tensor.parents[0]
        power = tensor._power  # Stored during forward
        # d(x^n)/dx = n * x^(n-1)
        grad = power * (parent.data ** (power - 1)) * grad_output
        return [grad]

    elif tensor.grad_fn == 'mean':
        parent = tensor.parents[0]
        # d(mean(x))/dx = 1/n
        grad = np.ones_like(parent.data) * grad_output / parent.data.size
        return [grad]

    # ... more operations ...

6.7 Temperature Example: Full Backward Pass

Let’s trace through:

# Setup
celsius = Tensor([[0.0], [100.0]])
targets = Tensor([[32.0], [212.0]])
w = Tensor([[0.5]], requires_grad=True)
b = Tensor([10.0], requires_grad=True)

# Forward
pred = celsius @ w.T + b        # [[10], [60]]
diff = pred - targets           # [[-22], [-152]]
squared = diff ** 2             # [[484], [23104]]
loss = squared.mean()           # 11794.0

# Backward
loss.backward()

Step by step:

1. loss.grad = 1.0

2. Through mean():
   squared.grad = 1.0 / 2 = 0.5  (for each element)

3. Through ** 2:
   diff.grad = 0.5 * 2 * diff = [[-22], [-152]]

4. Through subtraction:
   pred.grad = [[-22], [-152]]

5. Through addition:
   matmul_out.grad = [[-22], [-152]]
   b.grad = sum([[-22], [-152]]) = -174

6. Through matmul:
   w.grad = celsius.T @ matmul_out.grad
         = [[0, 100]] @ [[-22], [-152]]
         = [[-15200]]

Final gradients:

  • w.grad = -15200 (w should increase)
  • b.grad = -174 (b should increase)

6.8 Gradient Checking

Verify gradients numerically:

def gradient_check(param, loss_fn, eps=1e-5):
    """Check gradient using finite differences."""
    param.grad = None
    loss = loss_fn()
    loss.backward()
    analytic = param.grad.copy()

    numeric = np.zeros_like(param.data)
    for idx in np.ndindex(param.shape):
        param.data[idx] += eps
        loss_plus = loss_fn().data

        param.data[idx] -= 2 * eps
        loss_minus = loss_fn().data

        param.data[idx] += eps  # Restore

        numeric[idx] = (loss_plus - loss_minus) / (2 * eps)

    diff = np.abs(analytic - numeric).max()
    print(f"Max difference: {diff}")
    return diff < 1e-4

6.9 Why Accumulate Gradients?

We use += instead of =:

parent.grad += grad

Because a tensor might be used multiple times:

y = x + x  # x used twice!
# dy/dx = 1 + 1 = 2

Both paths contribute to the gradient.

6.10 Summary

Backpropagation:

  1. Topological sort — Order nodes from output to input
  2. Initialize — Set output gradient to 1
  3. Chain rule — Multiply incoming gradient by local gradient
  4. Accumulate — Add gradients when tensor is reused

No magic — just the chain rule, automated.

Next: Using gradients to actually train the model!