11  Learning Rate Schedules

A constant learning rate isn’t optimal. Schedules adapt it during training.

11.1 The Intuition

Early training: Large steps to make quick progress Late training: Small steps for fine-tuning

flowchart LR
    A[Start: lr=0.1] --> B[Middle: lr=0.01] --> C[End: lr=0.001]

11.2 Scheduler Base Class

class LRScheduler:
    """Base class for learning rate schedulers."""

    def __init__(self, optimizer):
        self.optimizer = optimizer
        self.base_lr = optimizer.lr
        self.step_count = 0

    def step(self):
        """Update learning rate (call after each epoch or step)."""
        raise NotImplementedError

    def get_lr(self):
        """Get current learning rate."""
        return self.optimizer.lr

11.3 Step Decay

Reduce lr by a factor every N epochs:

class StepLR(LRScheduler):
    """Decay lr by gamma every step_size epochs."""

    def __init__(self, optimizer, step_size, gamma=0.1):
        super().__init__(optimizer)
        self.step_size = step_size
        self.gamma = gamma

    def step(self):
        self.step_count += 1
        if self.step_count % self.step_size == 0:
            self.optimizer.lr *= self.gamma

Usage:

optimizer = Adam([w, b], lr=0.1)
scheduler = StepLR(optimizer, step_size=100, gamma=0.5)

for epoch in range(300):
    # ... training ...
    scheduler.step()  # lr: 0.1 → 0.05 → 0.025 → ...

11.4 Exponential Decay

Smooth decay every step:

class ExponentialLR(LRScheduler):
    """Decay lr by gamma every epoch."""

    def __init__(self, optimizer, gamma=0.99):
        super().__init__(optimizer)
        self.gamma = gamma

    def step(self):
        self.step_count += 1
        self.optimizer.lr = self.base_lr * (self.gamma ** self.step_count)

11.5 Cosine Annealing

Smooth decay following a cosine curve:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t}{T}\pi))\]

class CosineAnnealingLR(LRScheduler):
    """Cosine annealing schedule."""

    def __init__(self, optimizer, T_max, eta_min=0):
        super().__init__(optimizer)
        self.T_max = T_max
        self.eta_min = eta_min

    def step(self):
        self.step_count += 1
        progress = self.step_count / self.T_max
        self.optimizer.lr = self.eta_min + (self.base_lr - self.eta_min) * \
                            (1 + np.cos(np.pi * progress)) / 2

Cosine annealing is popular for training Transformers.

11.6 Warmup

Start with tiny lr, gradually increase:

class WarmupLR(LRScheduler):
    """Linear warmup for first N steps."""

    def __init__(self, optimizer, warmup_steps):
        super().__init__(optimizer)
        self.warmup_steps = warmup_steps
        self.target_lr = optimizer.lr
        self.optimizer.lr = 0  # Start from 0

    def step(self):
        self.step_count += 1
        if self.step_count <= self.warmup_steps:
            # Linear warmup
            self.optimizer.lr = self.target_lr * (self.step_count / self.warmup_steps)

Why warmup? - Adam’s moment estimates are biased early on - Warmup lets them stabilize before taking big steps

11.7 Warmup + Cosine Decay

The standard schedule for Transformers:

class WarmupCosineScheduler(LRScheduler):
    """Warmup then cosine decay."""

    def __init__(self, optimizer, warmup_steps, total_steps, eta_min=0):
        super().__init__(optimizer)
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.eta_min = eta_min

    def step(self):
        self.step_count += 1

        if self.step_count <= self.warmup_steps:
            # Linear warmup
            self.optimizer.lr = self.base_lr * (self.step_count / self.warmup_steps)
        else:
            # Cosine decay
            progress = (self.step_count - self.warmup_steps) / \
                       (self.total_steps - self.warmup_steps)
            self.optimizer.lr = self.eta_min + (self.base_lr - self.eta_min) * \
                                (1 + np.cos(np.pi * progress)) / 2

11.8 Temperature Training with Schedule

from tensorweaver import Tensor
from tensorweaver.optim import Adam
from tensorweaver.optim.lr_scheduler import CosineAnnealingLR

# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])

# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)

# Optimizer + Scheduler
optimizer = Adam([w, b], lr=0.5)
scheduler = CosineAnnealingLR(optimizer, T_max=200)

for epoch in range(200):
    pred = celsius @ w.T + b
    loss = ((pred - fahrenheit) ** 2).mean()

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    scheduler.step()  # Update learning rate

    if epoch % 40 == 0:
        print(f"Epoch {epoch}: loss={loss.data:.4f}, lr={optimizer.lr:.4f}")

print(f"Final: w={w.data.item():.3f}, b={b.data.item():.3f}")

Output:

Epoch 0: loss=5765.0000, lr=0.4961
Epoch 40: loss=0.5123, lr=0.3536
Epoch 80: loss=0.0089, lr=0.1545
Epoch 120: loss=0.0001, lr=0.0245
Epoch 160: loss=0.0000, lr=0.0015
Final: w=1.800, b=32.000

11.9 Comparing Schedules

Schedule Best For
Constant Simple problems
StepLR When you know good decay points
ExponentialLR Smooth decay
CosineAnnealing General purpose, Transformers
Warmup + Cosine Large models, Transformers

11.10 Part III Complete!

Tip

Milestone: You’ve built a complete training system!

  • ✓ Optimizer base class
  • ✓ SGD with momentum
  • ✓ Adam optimizer
  • ✓ Learning rate schedules

Training is now fast and stable.

11.11 Summary

  • LR schedules adapt learning rate during training
  • StepLR: Discrete drops
  • CosineAnnealing: Smooth decay
  • Warmup: Start slow for stability
  • Standard combo: Warmup + Cosine

Next: Building deeper networks with activation functions and regularization.