9 SGD with Momentum

Vanilla SGD is slow in flat regions and oscillates in steep ones. Momentum fixes this.

9.1 The Problem

Imagine a ball rolling down a valley:

Vanilla SGD: Ball moves only where the slope points, stops at every bump
Momentum: Ball builds up speed, rolls past small bumps

flowchart LR
    subgraph Vanilla SGD
        A1[slow] --> A2[stuck] --> A3[stuck]
    end
    subgraph With Momentum
        B1[slow] --> B2[faster] --> B3[through!]
    end

9.2 The Physics Analogy

Momentum mimics physical momentum:

Velocity accumulates over time
Heavy ball keeps moving even when gradient is small
Dampening (friction) prevents runaway

9.3 The Math

Vanilla SGD: \[\theta_{t+1} = \theta_t - \eta \cdot g_t\]

SGD with Momentum: \[v_{t+1} = \mu \cdot v_t + g_t\] \[\theta_{t+1} = \theta_t - \eta \cdot v_{t+1}\]

Where:

\(\theta\) = parameters
\(g\) = gradient
\(v\) = velocity
\(\eta\) = learning rate
\(\mu\) = momentum coefficient (typically 0.9)

9.4 Implementation

class SGD(Optimizer):
    """SGD with optional momentum."""

    def __init__(self, parameters, lr=0.01, momentum=0.0):
        super().__init__(parameters, lr)
        self.momentum = momentum

        # Initialize velocity for each parameter
        self.velocities = []
        for param in self.parameters:
            self.velocities.append(np.zeros_like(param.data))

    def step(self):
        for param, velocity in zip(self.parameters, self.velocities):
            if param.grad is None:
                continue

            if self.momentum > 0:
                # Update velocity: v = momentum * v + grad
                velocity *= self.momentum
                velocity += param.grad

                # Update parameter: p = p - lr * v
                param.data -= self.lr * velocity
            else:
                # Vanilla SGD
                param.data -= self.lr * param.grad

Note

Code Reference: See src/tensorweaver/optimizers/sgd.py for the implementation.

9.5 How Momentum Helps

9.5.1 Case 1: Flat Region

Without momentum:

grad ≈ 0 → step ≈ 0 → stuck!

With momentum:

velocity still has past gradients → keeps moving!

9.5.2 Case 2: Oscillating Gradients

Without momentum:

grad = +10, -10, +10, -10 → zigzag forever

With momentum:

Perpendicular components cancel out
Consistent direction accelerates

9.6 Visualizing Momentum

# Temperature training: comparing vanilla vs momentum

import matplotlib.pyplot as plt

def train_with_sgd(momentum, epochs=500):
    w = Tensor([[1.0]], requires_grad=True)
    b = Tensor([0.0], requires_grad=True)

    optimizer = SGD([w, b], lr=0.0001, momentum=momentum)
    losses = []

    for epoch in range(epochs):
        pred = celsius @ w.T + b
        loss = ((pred - fahrenheit) ** 2).mean()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        losses.append(loss.data)

    return losses

losses_vanilla = train_with_sgd(momentum=0.0)
losses_momentum = train_with_sgd(momentum=0.9)

# Momentum converges faster!

9.7 Choosing Momentum

Momentum	Effect
0.0	Vanilla SGD
0.5	Mild smoothing
0.9	Standard choice
0.99	Very smooth, may overshoot

Rule of thumb: Start with 0.9

9.8 Temperature Model with Momentum

from tensorweaver import Tensor
from tensorweaver.optim import SGD

# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])

# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)

# SGD with momentum
optimizer = SGD([w, b], lr=0.0001, momentum=0.9)

for epoch in range(500):
    pred = celsius @ w.T + b
    loss = ((pred - fahrenheit) ** 2).mean()

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}: loss={loss.data:.4f}")

print(f"Final: w={w.data.item():.3f}, b={b.data.item():.3f}")

With momentum=0.9, we typically need fewer epochs to converge!

9.9 Nesterov Momentum (Advanced)

A variant that “looks ahead”:

\[v_{t+1} = \mu \cdot v_t + \nabla f(\theta_t - \eta \mu v_t)\]

The gradient is computed at the “lookahead” position. Often converges faster.

class SGD(Optimizer):
    def __init__(self, parameters, lr=0.01, momentum=0.0, nesterov=False):
        # ...
        self.nesterov = nesterov

    def step(self):
        for param, velocity in zip(self.parameters, self.velocities):
            if param.grad is None:
                continue

            if self.momentum > 0:
                velocity *= self.momentum
                velocity += param.grad

                if self.nesterov:
                    # Nesterov: use momentum-adjusted gradient
                    param.data -= self.lr * (param.grad + self.momentum * velocity)
                else:
                    param.data -= self.lr * velocity
            else:
                param.data -= self.lr * param.grad

9.10 Summary

Momentum accumulates gradients over time
Helps escape flat regions and reduces oscillation
Standard value: 0.9
Nesterov variant looks ahead for better convergence

SGD with momentum is still widely used, but Adam is often more convenient.

Next: The Adam optimizer.