flowchart LR
subgraph UF["Underfitting"]
A[High train error, High test error]
end
subgraph GF["Good Fit"]
B[Low train error, Low test error]
end
subgraph OF["Overfitting"]
C[Low train error, HIGH test error]
end
13 Regularization
Deep networks can memorize training data. Regularization prevents overfitting.
13.1 The Overfitting Problem
Overfitting: Model memorizes training data but fails on new data.
13.2 Dropout
Randomly “drop” neurons during training:
class Dropout:
"""Randomly zero out elements during training."""
def __init__(self, p=0.5):
"""
Args:
p: Probability of dropping (0.5 = drop half)
"""
self.p = p
self.training = True
def __call__(self, x):
if not self.training or self.p == 0:
return x
# Create random mask
mask = (np.random.rand(*x.shape) > self.p).astype(np.float32)
# Scale by 1/(1-p) to maintain expected value
scale = 1.0 / (1 - self.p)
result_data = x.data * mask * scale
result = Tensor(result_data, requires_grad=x.requires_grad)
if x.requires_grad:
result.grad_fn = 'dropout'
result.parents = [x]
result._dropout_mask = mask * scale
return result
Note
Code Reference: See src/tensorweaver/layers/dropout.py for the implementation.
13.2.1 Why Dropout Works
- Prevents co-adaptation: Neurons can’t rely on specific other neurons
- Ensemble effect: Training many “sub-networks”
- Noise injection: Adds regularization noise
13.2.2 Dropout: Training vs Inference
dropout = Dropout(p=0.5)
# Training: randomly drop
dropout.training = True
out = dropout(x) # Some values become 0
# Inference: keep everything
dropout.training = False
out = dropout(x) # All values preserved13.3 Weight Decay (L2 Regularization)
Penalize large weights:
\[\mathcal{L}_{total} = \mathcal{L}_{data} + \lambda \sum_i w_i^2\]
This discourages the model from using very large weights.
In AdamW (Chapter 10), we implemented this:
# Decoupled weight decay
param.data -= lr * weight_decay * param.data13.4 Early Stopping
Stop training when validation loss stops improving:
best_val_loss = float('inf')
patience = 10
patience_counter = 0
for epoch in range(max_epochs):
# Training...
train_loss = train_one_epoch()
# Validation
val_loss = evaluate(val_data)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
save_model() # Save best model
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch}")
break
load_model() # Restore best model13.5 Data Augmentation
Create variations of training data:
def augment_tabular(x, noise_std=0.1):
"""Add small noise to tabular data."""
noise = np.random.randn(*x.shape) * noise_std
return x + noiseFor images: rotations, flips, crops, color changes. For text: synonym replacement, back-translation.
13.6 Regularization Summary
| Technique | What It Does | When to Use |
|---|---|---|
| Dropout | Randomly drops neurons | Hidden layers |
| Weight Decay | Penalizes large weights | Always (small λ) |
| Early Stopping | Stops before overfitting | Always |
| Data Augmentation | Creates more training data | When data is limited |
13.7 Using Dropout in a Network
from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.layers import Dropout
from tensorweaver.optim import Adam
# Model with dropout
W1 = Tensor(np.random.randn(4, 8) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(8), requires_grad=True)
W2 = Tensor(np.random.randn(8, 3) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(3), requires_grad=True)
dropout = Dropout(p=0.3)
def forward(x, training=True):
dropout.training = training
h = relu(x @ W1 + b1)
h = dropout(h) # Apply dropout
out = h @ W2 + b2
return out
optimizer = Adam([W1, b1, W2, b2], lr=0.01)
# Training
for epoch in range(epochs):
out = forward(x_train, training=True) # Dropout ON
loss = compute_loss(out, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Evaluation
dropout.training = False
out = forward(x_test, training=False) # Dropout OFF13.8 Common Dropout Values
| Layer Type | Dropout Rate |
|---|---|
| Input layer | 0.0 - 0.2 |
| Hidden layers | 0.3 - 0.5 |
| Before output | 0.0 - 0.2 |
13.9 Summary
- Overfitting: Model memorizes, doesn’t generalize
- Dropout: Randomly drop neurons during training
- Weight Decay: Penalize large weights
- Early Stopping: Stop when validation loss increases
- train/eval mode: Different behavior during training vs inference
Next: Normalization for stable training.