10 Adam Optimizer
Adam is the “default” optimizer for deep learning. It combines momentum with adaptive learning rates.
10.1 Why Adam?
| Optimizer | Learning Rate | Works Out of Box |
|---|---|---|
| Vanilla SGD | Needs careful tuning | Sometimes |
| SGD + Momentum | Needs careful tuning | Usually |
| Adam | 0.001 usually works | Almost always |
Adam adapts the learning rate per parameter based on gradient history.
10.2 The Intuition
Adam tracks two things for each parameter:
- First moment (m): Moving average of gradients (like momentum)
- Second moment (v): Moving average of squared gradients
Parameters with large gradients get smaller learning rates. Parameters with small gradients get larger learning rates.
10.3 The Math
\[m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t\] \[v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2\]
Bias correction (important for early steps): \[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\] \[\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]
Update: \[\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]
Default hyperparameters:
- \(\beta_1 = 0.9\) (momentum decay)
- \(\beta_2 = 0.999\) (RMSprop decay)
- \(\epsilon = 10^{-8}\) (numerical stability)
- \(\eta = 0.001\) (learning rate)
10.4 Implementation
class Adam(Optimizer):
"""Adam optimizer."""
def __init__(self, parameters, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
super().__init__(parameters, lr)
self.betas = betas
self.eps = eps
self.t = 0 # Timestep
# Initialize moment estimates
self.m = [np.zeros_like(p.data) for p in self.parameters]
self.v = [np.zeros_like(p.data) for p in self.parameters]
def step(self):
self.t += 1
beta1, beta2 = self.betas
for i, param in enumerate(self.parameters):
if param.grad is None:
continue
g = param.grad
# Update biased first moment estimate
self.m[i] = beta1 * self.m[i] + (1 - beta1) * g
# Update biased second moment estimate
self.v[i] = beta2 * self.v[i] + (1 - beta2) * (g ** 2)
# Bias correction
m_hat = self.m[i] / (1 - beta1 ** self.t)
v_hat = self.v[i] / (1 - beta2 ** self.t)
# Update parameters
param.data -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)Code Reference: See src/tensorweaver/optimizers/adam.py for the implementation.
10.5 Why Bias Correction?
At t=1 with β₂=0.999:
Without correction:
v_1 = 0.001 * g^2 # Way too small!
With correction:
v_hat = v_1 / (1 - 0.999^1) = v_1 / 0.001 = g^2 # Correct scale
Bias correction fixes the initialization problem.
10.6 Temperature Model with Adam
from tensorweaver import Tensor
from tensorweaver.optim import Adam
# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])
# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
# Adam optimizer
optimizer = Adam([w, b], lr=0.1) # Higher lr works with Adam!
for epoch in range(200): # Fewer epochs needed!
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
if epoch % 40 == 0:
print(f"Epoch {epoch}: loss={loss.data:.4f}")
print(f"Final: w={w.data.item():.3f}, b={b.data.item():.3f}")Adam typically converges in fewer iterations than SGD!
10.7 AdamW: Adam with Weight Decay
Standard Adam has a subtle bug with weight decay (L2 regularization). AdamW fixes it:
class AdamW(Adam):
"""Adam with decoupled weight decay."""
def __init__(self, parameters, lr=0.001, betas=(0.9, 0.999),
eps=1e-8, weight_decay=0.01):
super().__init__(parameters, lr, betas, eps)
self.weight_decay = weight_decay
def step(self):
self.t += 1
beta1, beta2 = self.betas
for i, param in enumerate(self.parameters):
if param.grad is None:
continue
# Decoupled weight decay (before Adam update)
param.data -= self.lr * self.weight_decay * param.data
g = param.grad
# Standard Adam update
self.m[i] = beta1 * self.m[i] + (1 - beta1) * g
self.v[i] = beta2 * self.v[i] + (1 - beta2) * (g ** 2)
m_hat = self.m[i] / (1 - beta1 ** self.t)
v_hat = self.v[i] / (1 - beta2 ** self.t)
param.data -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)AdamW is the standard for training Transformers.
10.8 Comparing Optimizers
def compare_optimizers():
results = {}
for name, OptClass, kwargs in [
('SGD', SGD, {'lr': 0.0001}),
('SGD+Momentum', SGD, {'lr': 0.0001, 'momentum': 0.9}),
('Adam', Adam, {'lr': 0.1}),
]:
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
opt = OptClass([w, b], **kwargs)
losses = []
for _ in range(300):
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
loss.backward()
opt.step()
opt.zero_grad()
losses.append(loss.data)
results[name] = losses
return resultsTypical result: Adam reaches low loss fastest.
10.9 When to Use What
| Situation | Recommendation |
|---|---|
| Default choice | Adam (lr=0.001) |
| Transformers | AdamW (lr=1e-4 to 3e-4) |
| Fine-tuning | Lower lr (1e-5 to 1e-4) |
| Convex problems | SGD can work well |
| Memory constrained | SGD (no moment storage) |
10.10 Summary
- Adam combines momentum + adaptive learning rates
- Default:
Adam(params, lr=0.001) - AdamW adds proper weight decay
- Usually works “out of the box”
Next: Learning rate schedules for even better training.