29  The Transformer Block

The fundamental building block of GPT. Stack these to build any Transformer.

29.1 Block Architecture

flowchart TD
    Input[Input] --> LN1[LayerNorm]
    LN1 --> Attn[Multi-Head Attention]
    Attn --> Add1[Add]
    Input --> Add1

    Add1 --> LN2[LayerNorm]
    LN2 --> FFN[Feed-Forward Network]
    FFN --> Add2[Add]
    Add1 --> Add2

    Add2 --> Output[Output]

Key components: 1. Multi-head attention with causal mask 2. Feed-forward network (MLP) 3. Layer normalization 4. Residual connections

29.2 Feed-Forward Network

Simple two-layer MLP with expansion:

class FeedForward(Module):
    """Position-wise feed-forward network."""

    def __init__(self, d_model, d_ff=None, dropout=0.0):
        super().__init__()
        d_ff = d_ff or 4 * d_model  # Standard: 4x expansion

        self.fc1 = Linear(d_model, d_ff)
        self.fc2 = Linear(d_ff, d_model)
        self.dropout = Dropout(dropout)

    def forward(self, x):
        x = self.fc1(x)
        x = gelu(x)  # GELU activation (GPT-2 uses this)
        x = self.dropout(x)
        x = self.fc2(x)
        return x
Note

Code Reference: See src/tensorweaver/layers/mlp.py for the implementation.

29.3 Pre-Norm vs Post-Norm

Original Transformer (Post-Norm):

x = layer_norm(x + attention(x))

GPT-2 and Modern (Pre-Norm):

x = x + attention(layer_norm(x))

Pre-norm is more stable for deep networks.

29.4 Complete Transformer Block

class TransformerBlock(Module):
    """A single Transformer block."""

    def __init__(self, d_model, num_heads, d_ff=None, dropout=0.0):
        super().__init__()

        # Attention
        self.ln1 = LayerNorm(d_model)
        self.attn = CausalMultiHeadAttention(d_model, num_heads)
        self.dropout1 = Dropout(dropout)

        # Feed-forward
        self.ln2 = LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.dropout2 = Dropout(dropout)

    def forward(self, x):
        # Attention block with residual
        attn_out = self.attn(self.ln1(x))
        x = x + self.dropout1(attn_out)

        # FFN block with residual
        ffn_out = self.ffn(self.ln2(x))
        x = x + self.dropout2(ffn_out)

        return x
Note

Code Reference: See src/tensorweaver/layers/transformer_block.py

29.5 Why Residual Connections?

Without residuals, deep networks suffer from: - Vanishing gradients: Gradients shrink through layers - Optimization difficulty: Hard to train

Residuals provide a “gradient highway”:

# Without residual: gradient must flow through layer
y = layer(x)

# With residual: gradient can skip layer
y = x + layer(x)
# dy/dx = 1 + dlayer/dx  ← always at least 1!

29.6 Why Layer Normalization?

Normalizes activations to prevent: - Exploding values - Vanishing values - Internal covariate shift

# LayerNorm normalizes across features
x_norm = (x - mean) / std

29.7 Stacking Blocks

GPT stacks multiple blocks:

class TransformerStack(Module):
    """Stack of Transformer blocks."""

    def __init__(self, num_layers, d_model, num_heads, d_ff=None, dropout=0.0):
        super().__init__()
        self.layers = ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

29.8 GPT-2 Configuration

Model Layers d_model Heads d_ff Parameters
Small 12 768 12 3072 124M
Medium 24 1024 16 4096 355M
Large 36 1280 20 5120 774M
XL 48 1600 25 6400 1.5B

29.9 Parameter Count Per Block

For GPT-2 small (d_model=768):

Attention:
  W_qkv: 768 × 2304 = 1,769,472
  W_out: 768 × 768  = 589,824

FFN:
  fc1: 768 × 3072 = 2,359,296
  fc2: 3072 × 768 = 2,359,296

LayerNorm (2x):
  2 × 768 × 2 = 3,072

Total per block: ~7.08M parameters
12 blocks: ~85M parameters

29.10 Testing the Block

# Create block
block = TransformerBlock(
    d_model=768,
    num_heads=12,
    d_ff=3072,
    dropout=0.1
)

# Test forward pass
x = Tensor(np.random.randn(2, 32, 768))  # (batch, seq, d_model)
y = block(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
# Both: (2, 32, 768)

print(f"Parameters: {sum(p.data.size for p in block.parameters()):,}")

29.11 Summary

Transformer block = Attention + FFN + LayerNorm + Residuals

  • Attention: Captures relationships between positions
  • FFN: Processes each position independently
  • LayerNorm: Stabilizes training
  • Residuals: Enable gradient flow

Next: Putting it all together — the complete GPT model.