flowchart LR
A[Start: lr=0.1] --> B[Middle: lr=0.01] --> C[End: lr=0.001]
11 Learning Rate Schedules
A constant learning rate isn’t optimal. Schedules adapt it during training.
11.1 The Intuition
Early training: Large steps to make quick progress Late training: Small steps for fine-tuning
11.2 Scheduler Base Class
class LRScheduler:
"""Base class for learning rate schedulers."""
def __init__(self, optimizer):
self.optimizer = optimizer
self.base_lr = optimizer.lr
self.step_count = 0
def step(self):
"""Update learning rate (call after each epoch or step)."""
raise NotImplementedError
def get_lr(self):
"""Get current learning rate."""
return self.optimizer.lr11.3 Step Decay
Reduce lr by a factor every N epochs:
class StepLR(LRScheduler):
"""Decay lr by gamma every step_size epochs."""
def __init__(self, optimizer, step_size, gamma=0.1):
super().__init__(optimizer)
self.step_size = step_size
self.gamma = gamma
def step(self):
self.step_count += 1
if self.step_count % self.step_size == 0:
self.optimizer.lr *= self.gammaUsage:
optimizer = Adam([w, b], lr=0.1)
scheduler = StepLR(optimizer, step_size=100, gamma=0.5)
for epoch in range(300):
# ... training ...
scheduler.step() # lr: 0.1 → 0.05 → 0.025 → ...11.4 Exponential Decay
Smooth decay every step:
class ExponentialLR(LRScheduler):
"""Decay lr by gamma every epoch."""
def __init__(self, optimizer, gamma=0.99):
super().__init__(optimizer)
self.gamma = gamma
def step(self):
self.step_count += 1
self.optimizer.lr = self.base_lr * (self.gamma ** self.step_count)11.5 Cosine Annealing
Smooth decay following a cosine curve:
\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t}{T}\pi))\]
class CosineAnnealingLR(LRScheduler):
"""Cosine annealing schedule."""
def __init__(self, optimizer, T_max, eta_min=0):
super().__init__(optimizer)
self.T_max = T_max
self.eta_min = eta_min
def step(self):
self.step_count += 1
progress = self.step_count / self.T_max
self.optimizer.lr = self.eta_min + (self.base_lr - self.eta_min) * \
(1 + np.cos(np.pi * progress)) / 2Cosine annealing is popular for training Transformers.
11.6 Warmup
Start with tiny lr, gradually increase:
class WarmupLR(LRScheduler):
"""Linear warmup for first N steps."""
def __init__(self, optimizer, warmup_steps):
super().__init__(optimizer)
self.warmup_steps = warmup_steps
self.target_lr = optimizer.lr
self.optimizer.lr = 0 # Start from 0
def step(self):
self.step_count += 1
if self.step_count <= self.warmup_steps:
# Linear warmup
self.optimizer.lr = self.target_lr * (self.step_count / self.warmup_steps)Why warmup? - Adam’s moment estimates are biased early on - Warmup lets them stabilize before taking big steps
11.7 Warmup + Cosine Decay
The standard schedule for Transformers:
class WarmupCosineScheduler(LRScheduler):
"""Warmup then cosine decay."""
def __init__(self, optimizer, warmup_steps, total_steps, eta_min=0):
super().__init__(optimizer)
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.eta_min = eta_min
def step(self):
self.step_count += 1
if self.step_count <= self.warmup_steps:
# Linear warmup
self.optimizer.lr = self.base_lr * (self.step_count / self.warmup_steps)
else:
# Cosine decay
progress = (self.step_count - self.warmup_steps) / \
(self.total_steps - self.warmup_steps)
self.optimizer.lr = self.eta_min + (self.base_lr - self.eta_min) * \
(1 + np.cos(np.pi * progress)) / 211.8 Temperature Training with Schedule
from tensorweaver import Tensor
from tensorweaver.optim import Adam
from tensorweaver.optim.lr_scheduler import CosineAnnealingLR
# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])
# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
# Optimizer + Scheduler
optimizer = Adam([w, b], lr=0.5)
scheduler = CosineAnnealingLR(optimizer, T_max=200)
for epoch in range(200):
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
scheduler.step() # Update learning rate
if epoch % 40 == 0:
print(f"Epoch {epoch}: loss={loss.data:.4f}, lr={optimizer.lr:.4f}")
print(f"Final: w={w.data.item():.3f}, b={b.data.item():.3f}")Output:
Epoch 0: loss=5765.0000, lr=0.4961
Epoch 40: loss=0.5123, lr=0.3536
Epoch 80: loss=0.0089, lr=0.1545
Epoch 120: loss=0.0001, lr=0.0245
Epoch 160: loss=0.0000, lr=0.0015
Final: w=1.800, b=32.000
11.9 Comparing Schedules
| Schedule | Best For |
|---|---|
| Constant | Simple problems |
| StepLR | When you know good decay points |
| ExponentialLR | Smooth decay |
| CosineAnnealing | General purpose, Transformers |
| Warmup + Cosine | Large models, Transformers |
11.10 Part III Complete!
Milestone: You’ve built a complete training system!
- ✓ Optimizer base class
- ✓ SGD with momentum
- ✓ Adam optimizer
- ✓ Learning rate schedules
Training is now fast and stable.
11.11 Summary
- LR schedules adapt learning rate during training
- StepLR: Discrete drops
- CosineAnnealing: Smooth decay
- Warmup: Start slow for stability
- Standard combo: Warmup + Cosine
Next: Building deeper networks with activation functions and regularization.