flowchart LR
subgraph Vanilla SGD
A1[slow] --> A2[stuck] --> A3[stuck]
end
subgraph With Momentum
B1[slow] --> B2[faster] --> B3[through!]
end
9 SGD with Momentum
Vanilla SGD is slow in flat regions and oscillates in steep ones. Momentum fixes this.
9.1 The Problem
Imagine a ball rolling down a valley:
- Vanilla SGD: Ball moves only where the slope points, stops at every bump
- Momentum: Ball builds up speed, rolls past small bumps
9.2 The Physics Analogy
Momentum mimics physical momentum:
- Velocity accumulates over time
- Heavy ball keeps moving even when gradient is small
- Dampening (friction) prevents runaway
9.3 The Math
Vanilla SGD: \[\theta_{t+1} = \theta_t - \eta \cdot g_t\]
SGD with Momentum: \[v_{t+1} = \mu \cdot v_t + g_t\] \[\theta_{t+1} = \theta_t - \eta \cdot v_{t+1}\]
Where:
- \(\theta\) = parameters
- \(g\) = gradient
- \(v\) = velocity
- \(\eta\) = learning rate
- \(\mu\) = momentum coefficient (typically 0.9)
9.4 Implementation
class SGD(Optimizer):
"""SGD with optional momentum."""
def __init__(self, parameters, lr=0.01, momentum=0.0):
super().__init__(parameters, lr)
self.momentum = momentum
# Initialize velocity for each parameter
self.velocities = []
for param in self.parameters:
self.velocities.append(np.zeros_like(param.data))
def step(self):
for param, velocity in zip(self.parameters, self.velocities):
if param.grad is None:
continue
if self.momentum > 0:
# Update velocity: v = momentum * v + grad
velocity *= self.momentum
velocity += param.grad
# Update parameter: p = p - lr * v
param.data -= self.lr * velocity
else:
# Vanilla SGD
param.data -= self.lr * param.gradCode Reference: See src/tensorweaver/optimizers/sgd.py for the implementation.
9.5 How Momentum Helps
9.5.1 Case 1: Flat Region
Without momentum:
grad ≈ 0 → step ≈ 0 → stuck!
With momentum:
velocity still has past gradients → keeps moving!
9.5.2 Case 2: Oscillating Gradients
Without momentum:
grad = +10, -10, +10, -10 → zigzag forever
With momentum:
Perpendicular components cancel out
Consistent direction accelerates
9.6 Visualizing Momentum
# Temperature training: comparing vanilla vs momentum
import matplotlib.pyplot as plt
def train_with_sgd(momentum, epochs=500):
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
optimizer = SGD([w, b], lr=0.0001, momentum=momentum)
losses = []
for epoch in range(epochs):
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
losses.append(loss.data)
return losses
losses_vanilla = train_with_sgd(momentum=0.0)
losses_momentum = train_with_sgd(momentum=0.9)
# Momentum converges faster!9.7 Choosing Momentum
| Momentum | Effect |
|---|---|
| 0.0 | Vanilla SGD |
| 0.5 | Mild smoothing |
| 0.9 | Standard choice |
| 0.99 | Very smooth, may overshoot |
Rule of thumb: Start with 0.9
9.8 Temperature Model with Momentum
from tensorweaver import Tensor
from tensorweaver.optim import SGD
# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])
# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
# SGD with momentum
optimizer = SGD([w, b], lr=0.0001, momentum=0.9)
for epoch in range(500):
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
if epoch % 100 == 0:
print(f"Epoch {epoch}: loss={loss.data:.4f}")
print(f"Final: w={w.data.item():.3f}, b={b.data.item():.3f}")With momentum=0.9, we typically need fewer epochs to converge!
9.9 Nesterov Momentum (Advanced)
A variant that “looks ahead”:
\[v_{t+1} = \mu \cdot v_t + \nabla f(\theta_t - \eta \mu v_t)\]
The gradient is computed at the “lookahead” position. Often converges faster.
class SGD(Optimizer):
def __init__(self, parameters, lr=0.01, momentum=0.0, nesterov=False):
# ...
self.nesterov = nesterov
def step(self):
for param, velocity in zip(self.parameters, self.velocities):
if param.grad is None:
continue
if self.momentum > 0:
velocity *= self.momentum
velocity += param.grad
if self.nesterov:
# Nesterov: use momentum-adjusted gradient
param.data -= self.lr * (param.grad + self.momentum * velocity)
else:
param.data -= self.lr * velocity
else:
param.data -= self.lr * param.grad9.10 Summary
- Momentum accumulates gradients over time
- Helps escape flat regions and reduces oscillation
- Standard value: 0.9
- Nesterov variant looks ahead for better convergence
SGD with momentum is still widely used, but Adam is often more convenient.
Next: The Adam optimizer.