14  Normalization

Deep networks suffer from internal covariate shift. Normalization stabilizes training.

14.1 The Problem

As training progresses, the distribution of layer inputs changes. Each layer must constantly adapt to new input distributions.

Internal Covariate Shift: The change in distribution of layer inputs during training.

14.2 Layer Normalization

Normalize across features (used in Transformers):

\[\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

Where:

  • \(\mu\) = mean across features
  • \(\sigma^2\) = variance across features
  • \(\gamma, \beta\) = learnable parameters
class LayerNorm:
    """Layer Normalization."""

    def __init__(self, normalized_shape, eps=1e-5):
        self.normalized_shape = normalized_shape
        self.eps = eps

        # Learnable parameters
        self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True)
        self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True)

    def __call__(self, x):
        # Compute mean and variance across last dimension(s)
        mean = x.data.mean(axis=-1, keepdims=True)
        var = x.data.var(axis=-1, keepdims=True)

        # Normalize
        x_norm = (x.data - mean) / np.sqrt(var + self.eps)

        # Scale and shift
        out = self.gamma.data * x_norm + self.beta.data

        result = Tensor(out, requires_grad=x.requires_grad)
        if x.requires_grad:
            result.grad_fn = 'layernorm'
            result.parents = [x, self.gamma, self.beta]
            result._ln_cache = (x_norm, mean, var)

        return result

    def parameters(self):
        return [self.gamma, self.beta]
Note

Code Reference: See src/tensorweaver/layers/layer_norm.py for the full implementation.

14.3 RMSNorm

Simplified normalization (used in LLaMA):

\[\text{RMSNorm}(x) = \gamma \cdot \frac{x}{\sqrt{\frac{1}{n}\sum x_i^2 + \epsilon}}\]

class RMSNorm:
    """Root Mean Square Normalization."""

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.weight = Tensor(np.ones(dim), requires_grad=True)

    def __call__(self, x):
        # RMS = sqrt(mean(x^2))
        rms = np.sqrt((x.data ** 2).mean(axis=-1, keepdims=True) + self.eps)

        # Normalize and scale
        out = x.data / rms * self.weight.data

        return Tensor(out, requires_grad=x.requires_grad)

    def parameters(self):
        return [self.weight]
Note

Code Reference: See src/tensorweaver/layers/rms_norm.py for the implementation.

14.4 Why RMSNorm?

LayerNorm RMSNorm
Computation mean + var only mean of squares
Parameters γ, β only γ
Speed Slower Faster
Performance Slightly better Nearly as good

RMSNorm is ~10% faster with similar quality.

14.5 Batch Normalization (for reference)

Normalize across the batch (common in CNNs):

\[\text{BatchNorm}(x) = \gamma \cdot \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} + \beta\]

Where \(\mu_B, \sigma_B^2\) are computed across the batch.

class BatchNorm:
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.momentum = momentum

        self.gamma = Tensor(np.ones(num_features), requires_grad=True)
        self.beta = Tensor(np.zeros(num_features), requires_grad=True)

        # Running statistics for inference
        self.running_mean = np.zeros(num_features)
        self.running_var = np.ones(num_features)

        self.training = True

    def __call__(self, x):
        if self.training:
            mean = x.data.mean(axis=0)
            var = x.data.var(axis=0)

            # Update running statistics
            self.running_mean = (1 - self.momentum) * self.running_mean + \
                                self.momentum * mean
            self.running_var = (1 - self.momentum) * self.running_var + \
                               self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var

        x_norm = (x.data - mean) / np.sqrt(var + self.eps)
        out = self.gamma.data * x_norm + self.beta.data

        return Tensor(out, requires_grad=x.requires_grad)

14.6 When to Use What

Normalization Best For
LayerNorm Transformers, RNNs
RMSNorm Modern LLMs (LLaMA, etc.)
BatchNorm CNNs, large batches

14.7 Using LayerNorm

from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.layers import LayerNorm
from tensorweaver.optim import Adam

# Parameters
W1 = Tensor(np.random.randn(4, 8) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(8), requires_grad=True)
ln = LayerNorm(8)  # Normalize 8 features
W2 = Tensor(np.random.randn(8, 3) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(3), requires_grad=True)

def forward(x):
    h = x @ W1 + b1
    h = ln(h)         # LayerNorm before activation
    h = relu(h)
    out = h @ W2 + b2
    return out

# Include LayerNorm parameters in optimizer
all_params = [W1, b1, W2, b2] + ln.parameters()
optimizer = Adam(all_params, lr=0.01)

14.8 Pre-Norm vs Post-Norm

Post-Norm (original Transformer):

x = x + attention(x)
x = layernorm(x)

Pre-Norm (modern practice):

x = x + attention(layernorm(x))

Pre-Norm is more stable for deep networks.

14.9 Summary

  • Normalization stabilizes training by normalizing activations
  • LayerNorm: Normalize across features (Transformers)
  • RMSNorm: Faster, simpler (modern LLMs)
  • BatchNorm: Normalize across batch (CNNs)
  • Pre-Norm is preferred for deep networks

Next: Putting it all together with an MLP for Iris classification.