7  Your First Training Loop

We have gradients. Now let’s use them to learn!

7.1 The Simplest Training Algorithm

Gradient descent in three lines:

loss.backward()           # Compute gradients
w.data -= lr * w.grad     # Update weights
b.data -= lr * b.grad     # Update bias

That’s it. The gradient tells us which direction increases the loss, so we go the opposite direction.

7.2 Learning Rate

The learning rate (lr) controls step size:

lr Effect
Too small (0.00001) Slow convergence
Just right (0.01) Steady progress
Too large (1.0) Oscillates, may diverge

flowchart LR
    subgraph S1["Too Small"]
        A1[step] --> A2[step] --> A3[step] --> A4[...]
    end
    subgraph S2["Just Right"]
        B1[step] --> B2[step] --> B3[minimum!]
    end
    subgraph S3["Too Large"]
        C1[step] --> C2[overshoot!] --> C3[overshoot!]
    end

7.3 Complete Training Example

Let’s learn the temperature conversion formula:

import numpy as np
from tensorweaver import Tensor

# Data: Celsius to Fahrenheit
celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])

# Initialize parameters randomly
np.random.seed(42)
w = Tensor(np.random.randn(1, 1), requires_grad=True)  # Should learn 1.8
b = Tensor(np.random.randn(1), requires_grad=True)     # Should learn 32

print(f"Initial: w={w.data.item():.3f}, b={b.data.item():.3f}")
# Initial: w=0.497, b=-0.139

# Training hyperparameters
lr = 0.0001  # Learning rate
epochs = 1000

# Training loop
for epoch in range(epochs):
    # Forward pass
    pred = celsius @ w.T + b

    # Compute loss
    diff = pred - fahrenheit
    loss = (diff ** 2).mean()

    # Backward pass
    loss.backward()

    # Update parameters (gradient descent)
    w.data -= lr * w.grad
    b.data -= lr * b.grad

    # Reset gradients for next iteration
    w.grad = None
    b.grad = None

    # Print progress
    if epoch % 200 == 0:
        print(f"Epoch {epoch}: loss={loss.data:.4f}, w={w.data.item():.3f}, b={b.data.item():.3f}")

print(f"\nFinal: w={w.data.item():.3f}, b={b.data.item():.3f}")
print(f"Target: w=1.800, b=32.000")

Output:

Initial: w=0.497, b=-0.139
Epoch 0: loss=8585.2324, w=0.497, b=-0.139
Epoch 200: loss=189.4721, w=1.632, b=17.284
Epoch 400: loss=23.8465, w=1.754, b=27.892
Epoch 600: loss=3.0012, w=1.786, b=30.896
Epoch 800: loss=0.3777, w=1.796, b=31.865

Final: w=1.799, b=31.957
Target: w=1.800, b=32.000

The model learned w ≈ 1.8 and b ≈ 32 from data!

7.4 Visualizing Learning

# Test on new data
test_celsius = Tensor([[37.0]])  # Body temperature
test_pred = test_celsius @ w.T + b
print(f"37°C = {test_pred.data.item():.1f}°F")
# 37°C = 98.6°F ✓

7.5 The Training Loop Breakdown

flowchart TD
    A[Start with random w, b] --> B[Forward: compute predictions]
    B --> C[Loss: measure error]
    C --> D[Backward: compute gradients]
    D --> E[Update: w -= lr * grad]
    E --> F{Converged?}
    F -->|No| B
    F -->|Yes| G[Done!]

  1. Forward pass: Compute predictions with current parameters
  2. Loss: Measure how wrong we are
  3. Backward pass: Compute gradients via backpropagation
  4. Update: Move parameters in direction that reduces loss
  5. Repeat: Until loss is small enough

7.6 Why Reset Gradients?

w.grad = None
b.grad = None

Gradients accumulate by default. Without resetting:

Epoch 0: grad = 100
Epoch 1: grad = 100 + 95 = 195  # Wrong!
Epoch 2: grad = 195 + 90 = 285  # Very wrong!

We want the gradient for this iteration only.

7.7 Putting It All Together

def train_temperature_model(epochs=1000, lr=0.0001):
    # Data
    celsius = Tensor([[0.0], [20.0], [40.0], [60.0], [80.0], [100.0]])
    fahrenheit = Tensor([[32.0], [68.0], [104.0], [140.0], [176.0], [212.0]])

    # Parameters
    w = Tensor([[1.0]], requires_grad=True)
    b = Tensor([0.0], requires_grad=True)

    for epoch in range(epochs):
        # Forward
        pred = celsius @ w.T + b
        loss = ((pred - fahrenheit) ** 2).mean()

        # Backward
        loss.backward()

        # Update
        w.data -= lr * w.grad
        b.data -= lr * b.grad

        # Reset
        w.grad = None
        b.grad = None

    return w, b

w, b = train_temperature_model()
print(f"Learned: F = C × {w.data.item():.3f} + {b.data.item():.3f}")
# Learned: F = C × 1.800 + 32.000
Tip

Part II Complete!

You’ve implemented the core learning algorithm:

  1. ✓ Loss functions (MSE)
  2. ✓ Computational graph
  3. ✓ Backpropagation
  4. ✓ Gradient descent training

Your model learned F = 1.8C + 32 from data!

7.8 Limitations of This Approach

Our “naive” training works, but has issues:

  1. Learning rate is critical — Too small = slow, too large = unstable
  2. No momentum — Gets stuck in flat regions
  3. Same lr for all parameters — Suboptimal

Part III introduces optimizers that solve these problems.

7.9 Summary

Training loop:

for epoch in range(epochs):
    pred = forward(x)           # 1. Compute predictions
    loss = loss_fn(pred, y)     # 2. Measure error
    loss.backward()             # 3. Compute gradients
    param.data -= lr * param.grad  # 4. Update parameters
    param.grad = None           # 5. Reset gradients

We’ve gone from “forward only” to “learning”!

Next: Making training smarter with optimizers.