13  Regularization

Deep networks can memorize training data. Regularization prevents overfitting.

13.1 The Overfitting Problem

flowchart LR
    subgraph UF["Underfitting"]
        A[High train error, High test error]
    end
    subgraph GF["Good Fit"]
        B[Low train error, Low test error]
    end
    subgraph OF["Overfitting"]
        C[Low train error, HIGH test error]
    end

Overfitting: Model memorizes training data but fails on new data.

13.2 Dropout

Randomly “drop” neurons during training:

class Dropout:
    """Randomly zero out elements during training."""

    def __init__(self, p=0.5):
        """
        Args:
            p: Probability of dropping (0.5 = drop half)
        """
        self.p = p
        self.training = True

    def __call__(self, x):
        if not self.training or self.p == 0:
            return x

        # Create random mask
        mask = (np.random.rand(*x.shape) > self.p).astype(np.float32)

        # Scale by 1/(1-p) to maintain expected value
        scale = 1.0 / (1 - self.p)

        result_data = x.data * mask * scale
        result = Tensor(result_data, requires_grad=x.requires_grad)

        if x.requires_grad:
            result.grad_fn = 'dropout'
            result.parents = [x]
            result._dropout_mask = mask * scale

        return result
Note

Code Reference: See src/tensorweaver/layers/dropout.py for the implementation.

13.2.1 Why Dropout Works

  1. Prevents co-adaptation: Neurons can’t rely on specific other neurons
  2. Ensemble effect: Training many “sub-networks”
  3. Noise injection: Adds regularization noise

13.2.2 Dropout: Training vs Inference

dropout = Dropout(p=0.5)

# Training: randomly drop
dropout.training = True
out = dropout(x)  # Some values become 0

# Inference: keep everything
dropout.training = False
out = dropout(x)  # All values preserved

13.3 Weight Decay (L2 Regularization)

Penalize large weights:

\[\mathcal{L}_{total} = \mathcal{L}_{data} + \lambda \sum_i w_i^2\]

This discourages the model from using very large weights.

In AdamW (Chapter 10), we implemented this:

# Decoupled weight decay
param.data -= lr * weight_decay * param.data

13.4 Early Stopping

Stop training when validation loss stops improving:

best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(max_epochs):
    # Training...
    train_loss = train_one_epoch()

    # Validation
    val_loss = evaluate(val_data)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_model()  # Save best model
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print(f"Early stopping at epoch {epoch}")
        break

load_model()  # Restore best model

13.5 Data Augmentation

Create variations of training data:

def augment_tabular(x, noise_std=0.1):
    """Add small noise to tabular data."""
    noise = np.random.randn(*x.shape) * noise_std
    return x + noise

For images: rotations, flips, crops, color changes. For text: synonym replacement, back-translation.

13.6 Regularization Summary

Technique What It Does When to Use
Dropout Randomly drops neurons Hidden layers
Weight Decay Penalizes large weights Always (small λ)
Early Stopping Stops before overfitting Always
Data Augmentation Creates more training data When data is limited

13.7 Using Dropout in a Network

from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.layers import Dropout
from tensorweaver.optim import Adam

# Model with dropout
W1 = Tensor(np.random.randn(4, 8) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(8), requires_grad=True)
W2 = Tensor(np.random.randn(8, 3) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(3), requires_grad=True)

dropout = Dropout(p=0.3)

def forward(x, training=True):
    dropout.training = training

    h = relu(x @ W1 + b1)
    h = dropout(h)  # Apply dropout
    out = h @ W2 + b2
    return out

optimizer = Adam([W1, b1, W2, b2], lr=0.01)

# Training
for epoch in range(epochs):
    out = forward(x_train, training=True)  # Dropout ON
    loss = compute_loss(out, y_train)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Evaluation
dropout.training = False
out = forward(x_test, training=False)  # Dropout OFF

13.8 Common Dropout Values

Layer Type Dropout Rate
Input layer 0.0 - 0.2
Hidden layers 0.3 - 0.5
Before output 0.0 - 0.2

13.9 Summary

  • Overfitting: Model memorizes, doesn’t generalize
  • Dropout: Randomly drop neurons during training
  • Weight Decay: Penalize large weights
  • Early Stopping: Stop when validation loss increases
  • train/eval mode: Different behavior during training vs inference

Next: Normalization for stable training.