flowchart LR
Model --> |parameters| Optimizer
Loss --> |backward| Gradients
Gradients --> Optimizer
Optimizer --> |step| Model
8 Optimizer Design
In Part II, we updated parameters manually. Let’s build proper optimizers.
8.1 The Problem with Manual Updates
Our Part II training loop:
w.data -= lr * w.grad
b.data -= lr * b.grad
w.grad = None
b.grad = NoneIssues:
- Repetitive — Same code for every parameter
- Error-prone — Easy to forget a parameter
- Inflexible — Hard to add momentum, adaptive rates, etc.
8.2 The Optimizer Abstraction
An optimizer manages all parameters and their updates:
class Optimizer:
"""Base class for all optimizers."""
def __init__(self, parameters, lr=0.01):
"""
Args:
parameters: List of tensors to optimize
lr: Learning rate
"""
self.parameters = list(parameters)
self.lr = lr
def zero_grad(self):
"""Reset all gradients to None."""
for param in self.parameters:
param.grad = None
def step(self):
"""Update parameters using gradients."""
raise NotImplementedError
Note
Code Reference: See src/tensorweaver/optimizers/ for all optimizer implementations.
8.3 Using an Optimizer
The training loop becomes cleaner:
# Before (manual)
for epoch in range(epochs):
pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
w.data -= lr * w.grad
b.data -= lr * b.grad
w.grad = None
b.grad = None
# After (with optimizer)
optimizer = SGD([w, b], lr=0.01)
for epoch in range(epochs):
pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
optimizer.step() # Update all parameters
optimizer.zero_grad() # Reset all gradientsMuch cleaner!
8.4 The Simplest Optimizer: Vanilla SGD
Stochastic Gradient Descent without bells and whistles:
class SGD(Optimizer):
"""Stochastic Gradient Descent optimizer."""
def step(self):
"""Update parameters: p = p - lr * grad"""
for param in self.parameters:
if param.grad is not None:
param.data -= self.lr * param.gradUsage:
# Initialize
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
optimizer = SGD([w, b], lr=0.01)
# Training step
loss.backward()
optimizer.step()
optimizer.zero_grad()8.5 Temperature Model with SGD
from tensorweaver import Tensor
from tensorweaver.optim import SGD
# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])
# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
# Optimizer
optimizer = SGD([w, b], lr=0.0001)
# Training
for epoch in range(1000):
# Forward
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
# Backward
loss.backward()
# Update
optimizer.step()
optimizer.zero_grad()
if epoch % 200 == 0:
print(f"Epoch {epoch}: loss={loss.data:.2f}")
print(f"Learned: w={w.data.item():.3f}, b={b.data.item():.3f}")8.6 Why step() Then zero_grad()?
Order matters:
# Correct order
loss.backward() # 1. Compute gradients
optimizer.step() # 2. Use gradients to update
optimizer.zero_grad() # 3. Clear for next iteration
# Wrong order
loss.backward()
optimizer.zero_grad() # Oops! Cleared before using
optimizer.step() # Gradients are None!8.7 Collecting Parameters
For models with many parameters, we need a parameters() method:
class LinearModel:
def __init__(self, in_features, out_features):
self.w = Tensor(np.random.randn(in_features, out_features) * 0.01,
requires_grad=True)
self.b = Tensor(np.zeros(out_features), requires_grad=True)
def __call__(self, x):
return x @ self.w + self.b
def parameters(self):
"""Return all trainable parameters."""
return [self.w, self.b]
# Usage
model = LinearModel(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)8.8 The Optimizer Pattern
- Model provides parameters to optimizer
- Loss backward fills gradients
- Optimizer uses gradients to update model
- Repeat
8.9 Summary
- Optimizer encapsulates parameter update logic
step()— Apply gradient updateszero_grad()— Reset gradients for next iteration- Vanilla SGD:
param -= lr * grad
Next: Adding momentum for faster, smoother training.