14 Normalization
Deep networks suffer from internal covariate shift. Normalization stabilizes training.
14.1 The Problem
As training progresses, the distribution of layer inputs changes. Each layer must constantly adapt to new input distributions.
Internal Covariate Shift: The change in distribution of layer inputs during training.
14.2 Layer Normalization
Normalize across features (used in Transformers):
\[\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]
Where:
- \(\mu\) = mean across features
- \(\sigma^2\) = variance across features
- \(\gamma, \beta\) = learnable parameters
class LayerNorm:
"""Layer Normalization."""
def __init__(self, normalized_shape, eps=1e-5):
self.normalized_shape = normalized_shape
self.eps = eps
# Learnable parameters
self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True)
self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True)
def __call__(self, x):
# Compute mean and variance across last dimension(s)
mean = x.data.mean(axis=-1, keepdims=True)
var = x.data.var(axis=-1, keepdims=True)
# Normalize
x_norm = (x.data - mean) / np.sqrt(var + self.eps)
# Scale and shift
out = self.gamma.data * x_norm + self.beta.data
result = Tensor(out, requires_grad=x.requires_grad)
if x.requires_grad:
result.grad_fn = 'layernorm'
result.parents = [x, self.gamma, self.beta]
result._ln_cache = (x_norm, mean, var)
return result
def parameters(self):
return [self.gamma, self.beta]Code Reference: See src/tensorweaver/layers/layer_norm.py for the full implementation.
14.3 RMSNorm
Simplified normalization (used in LLaMA):
\[\text{RMSNorm}(x) = \gamma \cdot \frac{x}{\sqrt{\frac{1}{n}\sum x_i^2 + \epsilon}}\]
class RMSNorm:
"""Root Mean Square Normalization."""
def __init__(self, dim, eps=1e-6):
self.eps = eps
self.weight = Tensor(np.ones(dim), requires_grad=True)
def __call__(self, x):
# RMS = sqrt(mean(x^2))
rms = np.sqrt((x.data ** 2).mean(axis=-1, keepdims=True) + self.eps)
# Normalize and scale
out = x.data / rms * self.weight.data
return Tensor(out, requires_grad=x.requires_grad)
def parameters(self):
return [self.weight]Code Reference: See src/tensorweaver/layers/rms_norm.py for the implementation.
14.4 Why RMSNorm?
| LayerNorm | RMSNorm | |
|---|---|---|
| Computation | mean + var | only mean of squares |
| Parameters | γ, β | only γ |
| Speed | Slower | Faster |
| Performance | Slightly better | Nearly as good |
RMSNorm is ~10% faster with similar quality.
14.5 Batch Normalization (for reference)
Normalize across the batch (common in CNNs):
\[\text{BatchNorm}(x) = \gamma \cdot \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} + \beta\]
Where \(\mu_B, \sigma_B^2\) are computed across the batch.
class BatchNorm:
def __init__(self, num_features, eps=1e-5, momentum=0.1):
self.eps = eps
self.momentum = momentum
self.gamma = Tensor(np.ones(num_features), requires_grad=True)
self.beta = Tensor(np.zeros(num_features), requires_grad=True)
# Running statistics for inference
self.running_mean = np.zeros(num_features)
self.running_var = np.ones(num_features)
self.training = True
def __call__(self, x):
if self.training:
mean = x.data.mean(axis=0)
var = x.data.var(axis=0)
# Update running statistics
self.running_mean = (1 - self.momentum) * self.running_mean + \
self.momentum * mean
self.running_var = (1 - self.momentum) * self.running_var + \
self.momentum * var
else:
mean = self.running_mean
var = self.running_var
x_norm = (x.data - mean) / np.sqrt(var + self.eps)
out = self.gamma.data * x_norm + self.beta.data
return Tensor(out, requires_grad=x.requires_grad)14.6 When to Use What
| Normalization | Best For |
|---|---|
| LayerNorm | Transformers, RNNs |
| RMSNorm | Modern LLMs (LLaMA, etc.) |
| BatchNorm | CNNs, large batches |
14.7 Using LayerNorm
from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.layers import LayerNorm
from tensorweaver.optim import Adam
# Parameters
W1 = Tensor(np.random.randn(4, 8) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(8), requires_grad=True)
ln = LayerNorm(8) # Normalize 8 features
W2 = Tensor(np.random.randn(8, 3) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(3), requires_grad=True)
def forward(x):
h = x @ W1 + b1
h = ln(h) # LayerNorm before activation
h = relu(h)
out = h @ W2 + b2
return out
# Include LayerNorm parameters in optimizer
all_params = [W1, b1, W2, b2] + ln.parameters()
optimizer = Adam(all_params, lr=0.01)14.8 Pre-Norm vs Post-Norm
Post-Norm (original Transformer):
x = x + attention(x)
x = layernorm(x)Pre-Norm (modern practice):
x = x + attention(layernorm(x))Pre-Norm is more stable for deep networks.
14.9 Summary
- Normalization stabilizes training by normalizing activations
- LayerNorm: Normalize across features (Transformers)
- RMSNorm: Faster, simpler (modern LLMs)
- BatchNorm: Normalize across batch (CNNs)
- Pre-Norm is preferred for deep networks
Next: Putting it all together with an MLP for Iris classification.