4 Loss Functions

How do we know if our model is wrong? Loss functions measure the error.

4.1 The Problem

In Part I, we hardcoded w=1.8 and b=32:

fahrenheit = celsius @ w.T + b  # Perfect predictions!

But what if we start with random values?

w = Tensor([[0.5]])   # Wrong!
b = Tensor([10.0])    # Wrong!

fahrenheit = celsius @ w.T + b
print(fahrenheit)
# [[10.0], [20.0], [28.5], [60.0]]  # All wrong!

We need a way to measure how wrong we are.

4.2 Mean Squared Error (MSE)

The most common loss for regression:

\[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

Where:

\(y_i\) = true value (target)
\(\hat{y}_i\) = predicted value
\(n\) = number of samples

def mse_loss(predictions, targets):
    """Mean Squared Error loss."""
    diff = predictions - targets
    squared = diff ** 2
    return squared.mean()

4.3 Why Squared?

Why not just use absolute difference?

Loss	Formula	Gradient
Absolute	\(\\|y - \hat{y}\\|\)	±1 (discontinuous at 0)
Squared	\((y - \hat{y})^2\)	\(2(y - \hat{y})\) (smooth)

Squared error:

Smooth gradient — Easier to optimize
Penalizes large errors more — Outliers matter
Differentiable everywhere — No discontinuities

4.4 Temperature Conversion Example

Let’s compute the loss for our wrong predictions:

# True values (what we want)
celsius = Tensor([[0.0], [100.0]])
true_fahrenheit = Tensor([[32.0], [212.0]])

# Wrong model
w = Tensor([[0.5]])
b = Tensor([10.0])

# Predictions (what we got)
pred_fahrenheit = celsius @ w.T + b
print(f"Predictions: {pred_fahrenheit.data.flatten()}")
# [10.0, 60.0]

# Loss
loss = mse_loss(pred_fahrenheit, true_fahrenheit)
print(f"MSE Loss: {loss.item()}")
# MSE = ((32-10)² + (212-60)²) / 2
#     = (484 + 23104) / 2
#     = 11794.0

That’s a big loss! Let’s try better parameters:

# Better model
w = Tensor([[1.8]])
b = Tensor([32.0])

pred_fahrenheit = celsius @ w.T + b
loss = mse_loss(pred_fahrenheit, true_fahrenheit)
print(f"MSE Loss: {loss.item()}")
# MSE = ((32-32)² + (212-212)²) / 2 = 0.0

Loss of 0 means perfect predictions!

4.5 Loss as a Compass

Think of loss as a compass pointing toward better parameters:

flowchart LR
    subgraph HL["High Loss"]
        A[w=0.5, b=10] --> L1[Loss: 11794]
    end
    subgraph LL["Low Loss"]
        B[w=1.8, b=32] --> L2[Loss: 0]
    end

Our goal: Find parameters that minimize the loss.

4.6 Implementing MSE Loss

Let’s add MSE to our Tensor class:

def mse_loss(predictions, targets):
    """
    Mean Squared Error loss.

    Args:
        predictions: Model outputs, shape (n, ...)
        targets: True values, shape (n, ...)

    Returns:
        Scalar tensor with mean squared error
    """
    diff = predictions - targets
    squared = diff ** 2
    return squared.mean()

Note

Code Reference: See src/tensorweaver/losses/mse.py for the implementation.

4.7 Cross-Entropy Loss (Preview)

MSE works for regression (predicting numbers). For classification (predicting categories), we use Cross-Entropy Loss.

We’ll use this in Part IV for Iris flower classification, but here’s the intuition:

4.7.1 Softmax: Turning Scores into Probabilities

Neural networks output raw scores (logits). Softmax converts them to probabilities:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

def softmax(x):
    """Convert logits to probabilities."""
    exp_x = np.exp(x - x.max(axis=-1, keepdims=True))  # Numerical stability
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)  # [0.659, 0.242, 0.099] - sums to 1.0

4.7.2 Cross-Entropy: Measuring Classification Error

\[\text{CE} = -\sum_i y_i \log(\hat{y}_i)\]

Where \(y\) is one-hot encoded target, \(\hat{y}\) is predicted probability.

def cross_entropy_loss(logits, targets):
    """
    Cross-entropy loss for classification.

    Args:
        logits: Raw scores, shape (batch, num_classes)
        targets: Class indices, shape (batch,)

    Returns:
        Scalar loss
    """
    # Softmax
    probs = softmax(logits)

    # Select probability of correct class
    batch_size = logits.shape[0]
    correct_probs = probs[np.arange(batch_size), targets]

    # Negative log probability
    loss = -np.log(correct_probs + 1e-8)  # Add epsilon for stability

    return loss.mean()

4.7.3 Softmax Backward

The gradient of softmax combined with cross-entropy is elegantly simple:

\[\frac{\partial \text{CE}}{\partial x_i} = \hat{y}_i - y_i\]

def softmax_cross_entropy_backward(probs, targets):
    """
    Gradient of cross-entropy loss w.r.t. logits.

    The beautiful result: gradient = predictions - targets
    """
    grad = probs.copy()
    batch_size = probs.shape[0]
    grad[np.arange(batch_size), targets] -= 1
    return grad / batch_size

This is why softmax + cross-entropy is so popular: the gradient is just predictions - targets!

Note

Code Reference: See src/tensorweaver/losses/cross_entropy.py for the implementation.

Task	Loss Function	Output Activation
Regression	MSE	None (linear)
Binary Classification	Binary Cross-Entropy	Sigmoid
Multi-class Classification	Cross-Entropy	Softmax

4.8 The Loss Landscape

For our temperature model with 2 parameters (w, b), loss forms a surface:

Loss
 ^
 |    *  High loss region
 |   / \
 |  /   \
 | /     \
 |/   *   \___  Low loss region (goal!)
 +-------------> w, b

Training = finding the lowest point on this surface.

4.9 Breaking Down the Loss Computation

Let’s trace through step by step:

celsius = Tensor([[0.0], [100.0]])        # (2, 1)
targets = Tensor([[32.0], [212.0]])       # (2, 1)

w = Tensor([[0.5]])                        # (1, 1)
b = Tensor([10.0])                         # (1,)

# Step 1: Forward pass
pred = celsius @ w.T + b
# pred = [[0*0.5+10], [100*0.5+10]] = [[10], [60]]

# Step 2: Compute difference
diff = pred - targets
# diff = [[10-32], [60-212]] = [[-22], [-152]]

# Step 3: Square the difference
squared = diff ** 2
# squared = [[484], [23104]]

# Step 4: Take the mean
loss = squared.mean()
# loss = (484 + 23104) / 2 = 11794

4.10 What’s Next

We know our model is wrong (loss = 11794). But how do we improve it?

We need to answer: Which direction should we change w and b to reduce the loss?

This requires computing gradients — how the loss changes with respect to each parameter.

That’s what backpropagation does, and it’s the topic of the next chapters.

4.11 Summary

Loss functions measure prediction error
MSE = mean of squared differences
Lower loss = better predictions
Loss of 0 = perfect predictions
Training = minimizing the loss

Next: Building the computational graph that enables backpropagation.