flowchart LR
subgraph HL["High Loss"]
A[w=0.5, b=10] --> L1[Loss: 11794]
end
subgraph LL["Low Loss"]
B[w=1.8, b=32] --> L2[Loss: 0]
end
4 Loss Functions
How do we know if our model is wrong? Loss functions measure the error.
4.1 The Problem
In Part I, we hardcoded w=1.8 and b=32:
fahrenheit = celsius @ w.T + b # Perfect predictions!But what if we start with random values?
w = Tensor([[0.5]]) # Wrong!
b = Tensor([10.0]) # Wrong!
fahrenheit = celsius @ w.T + b
print(fahrenheit)
# [[10.0], [20.0], [28.5], [60.0]] # All wrong!We need a way to measure how wrong we are.
4.2 Mean Squared Error (MSE)
The most common loss for regression:
\[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
Where:
- \(y_i\) = true value (target)
- \(\hat{y}_i\) = predicted value
- \(n\) = number of samples
def mse_loss(predictions, targets):
"""Mean Squared Error loss."""
diff = predictions - targets
squared = diff ** 2
return squared.mean()4.3 Why Squared?
Why not just use absolute difference?
| Loss | Formula | Gradient |
|---|---|---|
| Absolute | \(\|y - \hat{y}\|\) | ±1 (discontinuous at 0) |
| Squared | \((y - \hat{y})^2\) | \(2(y - \hat{y})\) (smooth) |
Squared error:
- Smooth gradient — Easier to optimize
- Penalizes large errors more — Outliers matter
- Differentiable everywhere — No discontinuities
4.4 Temperature Conversion Example
Let’s compute the loss for our wrong predictions:
# True values (what we want)
celsius = Tensor([[0.0], [100.0]])
true_fahrenheit = Tensor([[32.0], [212.0]])
# Wrong model
w = Tensor([[0.5]])
b = Tensor([10.0])
# Predictions (what we got)
pred_fahrenheit = celsius @ w.T + b
print(f"Predictions: {pred_fahrenheit.data.flatten()}")
# [10.0, 60.0]
# Loss
loss = mse_loss(pred_fahrenheit, true_fahrenheit)
print(f"MSE Loss: {loss.item()}")
# MSE = ((32-10)² + (212-60)²) / 2
# = (484 + 23104) / 2
# = 11794.0That’s a big loss! Let’s try better parameters:
# Better model
w = Tensor([[1.8]])
b = Tensor([32.0])
pred_fahrenheit = celsius @ w.T + b
loss = mse_loss(pred_fahrenheit, true_fahrenheit)
print(f"MSE Loss: {loss.item()}")
# MSE = ((32-32)² + (212-212)²) / 2 = 0.0Loss of 0 means perfect predictions!
4.5 Loss as a Compass
Think of loss as a compass pointing toward better parameters:
Our goal: Find parameters that minimize the loss.
4.6 Implementing MSE Loss
Let’s add MSE to our Tensor class:
def mse_loss(predictions, targets):
"""
Mean Squared Error loss.
Args:
predictions: Model outputs, shape (n, ...)
targets: True values, shape (n, ...)
Returns:
Scalar tensor with mean squared error
"""
diff = predictions - targets
squared = diff ** 2
return squared.mean()Code Reference: See src/tensorweaver/losses/mse.py for the implementation.
4.7 Cross-Entropy Loss (Preview)
MSE works for regression (predicting numbers). For classification (predicting categories), we use Cross-Entropy Loss.
We’ll use this in Part IV for Iris flower classification, but here’s the intuition:
4.7.1 Softmax: Turning Scores into Probabilities
Neural networks output raw scores (logits). Softmax converts them to probabilities:
\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]
def softmax(x):
"""Convert logits to probabilities."""
exp_x = np.exp(x - x.max(axis=-1, keepdims=True)) # Numerical stability
return exp_x / exp_x.sum(axis=-1, keepdims=True)
# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs) # [0.659, 0.242, 0.099] - sums to 1.04.7.2 Cross-Entropy: Measuring Classification Error
\[\text{CE} = -\sum_i y_i \log(\hat{y}_i)\]
Where \(y\) is one-hot encoded target, \(\hat{y}\) is predicted probability.
def cross_entropy_loss(logits, targets):
"""
Cross-entropy loss for classification.
Args:
logits: Raw scores, shape (batch, num_classes)
targets: Class indices, shape (batch,)
Returns:
Scalar loss
"""
# Softmax
probs = softmax(logits)
# Select probability of correct class
batch_size = logits.shape[0]
correct_probs = probs[np.arange(batch_size), targets]
# Negative log probability
loss = -np.log(correct_probs + 1e-8) # Add epsilon for stability
return loss.mean()4.7.3 Softmax Backward
The gradient of softmax combined with cross-entropy is elegantly simple:
\[\frac{\partial \text{CE}}{\partial x_i} = \hat{y}_i - y_i\]
def softmax_cross_entropy_backward(probs, targets):
"""
Gradient of cross-entropy loss w.r.t. logits.
The beautiful result: gradient = predictions - targets
"""
grad = probs.copy()
batch_size = probs.shape[0]
grad[np.arange(batch_size), targets] -= 1
return grad / batch_sizeThis is why softmax + cross-entropy is so popular: the gradient is just predictions - targets!
Code Reference: See src/tensorweaver/losses/cross_entropy.py for the implementation.
| Task | Loss Function | Output Activation |
|---|---|---|
| Regression | MSE | None (linear) |
| Binary Classification | Binary Cross-Entropy | Sigmoid |
| Multi-class Classification | Cross-Entropy | Softmax |
4.8 The Loss Landscape
For our temperature model with 2 parameters (w, b), loss forms a surface:
Loss
^
| * High loss region
| / \
| / \
| / \
|/ * \___ Low loss region (goal!)
+-------------> w, b
Training = finding the lowest point on this surface.
4.9 Breaking Down the Loss Computation
Let’s trace through step by step:
celsius = Tensor([[0.0], [100.0]]) # (2, 1)
targets = Tensor([[32.0], [212.0]]) # (2, 1)
w = Tensor([[0.5]]) # (1, 1)
b = Tensor([10.0]) # (1,)
# Step 1: Forward pass
pred = celsius @ w.T + b
# pred = [[0*0.5+10], [100*0.5+10]] = [[10], [60]]
# Step 2: Compute difference
diff = pred - targets
# diff = [[10-32], [60-212]] = [[-22], [-152]]
# Step 3: Square the difference
squared = diff ** 2
# squared = [[484], [23104]]
# Step 4: Take the mean
loss = squared.mean()
# loss = (484 + 23104) / 2 = 117944.10 What’s Next
We know our model is wrong (loss = 11794). But how do we improve it?
We need to answer: Which direction should we change w and b to reduce the loss?
This requires computing gradients — how the loss changes with respect to each parameter.
That’s what backpropagation does, and it’s the topic of the next chapters.
4.11 Summary
- Loss functions measure prediction error
- MSE = mean of squared differences
- Lower loss = better predictions
- Loss of 0 = perfect predictions
- Training = minimizing the loss
Next: Building the computational graph that enables backpropagation.