flowchart LR
subgraph S1["Too Small"]
A1[step] --> A2[step] --> A3[step] --> A4[...]
end
subgraph S2["Just Right"]
B1[step] --> B2[step] --> B3[minimum!]
end
subgraph S3["Too Large"]
C1[step] --> C2[overshoot!] --> C3[overshoot!]
end
7 Your First Training Loop
We have gradients. Now let’s use them to learn!
7.1 The Simplest Training Algorithm
Gradient descent in three lines:
loss.backward() # Compute gradients
w.data -= lr * w.grad # Update weights
b.data -= lr * b.grad # Update biasThat’s it. The gradient tells us which direction increases the loss, so we go the opposite direction.
7.2 Learning Rate
The learning rate (lr) controls step size:
| lr | Effect |
|---|---|
| Too small (0.00001) | Slow convergence |
| Just right (0.01) | Steady progress |
| Too large (1.0) | Oscillates, may diverge |
7.3 Complete Training Example
Let’s learn the temperature conversion formula:
import numpy as np
from tensorweaver import Tensor
# Data: Celsius to Fahrenheit
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])
# Initialize parameters randomly
np.random.seed(42)
w = Tensor(np.random.randn(1, 1), requires_grad=True) # Should learn 1.8
b = Tensor(np.random.randn(1), requires_grad=True) # Should learn 32
print(f"Initial: w={w.data.item():.3f}, b={b.data.item():.3f}")
# Initial: w=0.497, b=-0.139
# Training hyperparameters
lr = 0.0001 # Learning rate
epochs = 1000
# Training loop
for epoch in range(epochs):
# Forward pass
pred = celsius @ w.T + b
# Compute loss
diff = pred - fahrenheit
loss = (diff ** 2).mean()
# Backward pass
loss.backward()
# Update parameters (gradient descent)
w.data -= lr * w.grad
b.data -= lr * b.grad
# Reset gradients for next iteration
w.grad = None
b.grad = None
# Print progress
if epoch % 200 == 0:
print(f"Epoch {epoch}: loss={loss.data:.4f}, w={w.data.item():.3f}, b={b.data.item():.3f}")
print(f"\nFinal: w={w.data.item():.3f}, b={b.data.item():.3f}")
print(f"Target: w=1.800, b=32.000")Output:
Initial: w=0.497, b=-0.139
Epoch 0: loss=8585.2324, w=0.497, b=-0.139
Epoch 200: loss=189.4721, w=1.632, b=17.284
Epoch 400: loss=23.8465, w=1.754, b=27.892
Epoch 600: loss=3.0012, w=1.786, b=30.896
Epoch 800: loss=0.3777, w=1.796, b=31.865
Final: w=1.799, b=31.957
Target: w=1.800, b=32.000
The model learned w ≈ 1.8 and b ≈ 32 from data!
7.4 Visualizing Learning
# Test on new data
test_celsius = Tensor([[37.0]]) # Body temperature
test_pred = test_celsius @ w.T + b
print(f"37°C = {test_pred.data.item():.1f}°F")
# 37°C = 98.6°F ✓7.5 The Training Loop Breakdown
flowchart TD
A[Start with random w, b] --> B[Forward: compute predictions]
B --> C[Loss: measure error]
C --> D[Backward: compute gradients]
D --> E[Update: w -= lr * grad]
E --> F{Converged?}
F -->|No| B
F -->|Yes| G[Done!]
- Forward pass: Compute predictions with current parameters
- Loss: Measure how wrong we are
- Backward pass: Compute gradients via backpropagation
- Update: Move parameters in direction that reduces loss
- Repeat: Until loss is small enough
7.6 Why Reset Gradients?
w.grad = None
b.grad = NoneGradients accumulate by default. Without resetting:
Epoch 0: grad = 100
Epoch 1: grad = 100 + 95 = 195 # Wrong!
Epoch 2: grad = 195 + 90 = 285 # Very wrong!
We want the gradient for this iteration only.
7.7 Putting It All Together
def train_temperature_model(epochs=1000, lr=0.0001):
# Data
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])
# Parameters
w = Tensor([[1.0]], requires_grad=True)
b = Tensor([0.0], requires_grad=True)
for epoch in range(epochs):
# Forward
pred = celsius @ w.T + b
loss = ((pred - fahrenheit) ** 2).mean()
# Backward
loss.backward()
# Update
w.data -= lr * w.grad
b.data -= lr * b.grad
# Reset
w.grad = None
b.grad = None
return w, b
w, b = train_temperature_model()
print(f"Learned: F = C × {w.data.item():.3f} + {b.data.item():.3f}")
# Learned: F = C × 1.800 + 32.000
Tip
Part II Complete!
You’ve implemented the core learning algorithm:
- ✓ Loss functions (MSE)
- ✓ Computational graph
- ✓ Backpropagation
- ✓ Gradient descent training
Your model learned F = 1.8C + 32 from data!
7.8 Limitations of This Approach
Our “naive” training works, but has issues:
- Learning rate is critical — Too small = slow, too large = unstable
- No momentum — Gets stuck in flat regions
- Same lr for all parameters — Suboptimal
Part III introduces optimizers that solve these problems.
7.9 Summary
Training loop:
for epoch in range(epochs):
pred = forward(x) # 1. Compute predictions
loss = loss_fn(pred, y) # 2. Measure error
loss.backward() # 3. Compute gradients
param.data -= lr * param.grad # 4. Update parameters
param.grad = None # 5. Reset gradientsWe’ve gone from “forward only” to “learning”!
Next: Making training smarter with optimizers.