29 The Transformer Block

The fundamental building block of GPT. Stack these to build any Transformer.

29.1 Block Architecture

flowchart TD
    Input[Input] --> LN1[LayerNorm]
    LN1 --> Attn[Multi-Head Attention]
    Attn --> Add1[Add]
    Input --> Add1

    Add1 --> LN2[LayerNorm]
    LN2 --> FFN[Feed-Forward Network]
    FFN --> Add2[Add]
    Add1 --> Add2

    Add2 --> Output[Output]

Key components: 1. Multi-head attention with causal mask 2. Feed-forward network (MLP) 3. Layer normalization 4. Residual connections

29.2 Feed-Forward Network

Simple two-layer MLP with expansion:

class FeedForward(Module):
    """Position-wise feed-forward network."""

    def __init__(self, d_model, d_ff=None, dropout=0.0):
        super().__init__()
        d_ff = d_ff or 4 * d_model  # Standard: 4x expansion

        self.fc1 = Linear(d_model, d_ff)
        self.fc2 = Linear(d_ff, d_model)
        self.dropout = Dropout(dropout)

    def forward(self, x):
        x = self.fc1(x)
        x = gelu(x)  # GELU activation (GPT-2 uses this)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

Note

Code Reference: See src/tensorweaver/layers/mlp.py for the implementation.

29.3 Pre-Norm vs Post-Norm

Original Transformer (Post-Norm):

x = layer_norm(x + attention(x))

GPT-2 and Modern (Pre-Norm):

x = x + attention(layer_norm(x))

Pre-norm is more stable for deep networks.

29.4 Complete Transformer Block

class TransformerBlock(Module):
    """A single Transformer block."""

    def __init__(self, d_model, num_heads, d_ff=None, dropout=0.0):
        super().__init__()

        # Attention
        self.ln1 = LayerNorm(d_model)
        self.attn = CausalMultiHeadAttention(d_model, num_heads)
        self.dropout1 = Dropout(dropout)

        # Feed-forward
        self.ln2 = LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.dropout2 = Dropout(dropout)

    def forward(self, x):
        # Attention block with residual
        attn_out = self.attn(self.ln1(x))
        x = x + self.dropout1(attn_out)

        # FFN block with residual
        ffn_out = self.ffn(self.ln2(x))
        x = x + self.dropout2(ffn_out)

        return x

Note

Code Reference: See src/tensorweaver/layers/transformer_block.py

29.5 Why Residual Connections?

Without residuals, deep networks suffer from: - Vanishing gradients: Gradients shrink through layers - Optimization difficulty: Hard to train

Residuals provide a “gradient highway”:

# Without residual: gradient must flow through layer
y = layer(x)

# With residual: gradient can skip layer
y = x + layer(x)
# dy/dx = 1 + dlayer/dx  ← always at least 1!

29.6 Why Layer Normalization?

Normalizes activations to prevent: - Exploding values - Vanishing values - Internal covariate shift

# LayerNorm normalizes across features
x_norm = (x - mean) / std

29.7 Stacking Blocks

GPT stacks multiple blocks:

class TransformerStack(Module):
    """Stack of Transformer blocks."""

    def __init__(self, num_layers, d_model, num_heads, d_ff=None, dropout=0.0):
        super().__init__()
        self.layers = ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

29.8 GPT-2 Configuration

Model	Layers	d_model	Heads	d_ff	Parameters
Small	12	768	12	3072	124M
Medium	24	1024	16	4096	355M
Large	36	1280	20	5120	774M
XL	48	1600	25	6400	1.5B

29.9 Parameter Count Per Block

For GPT-2 small (d_model=768):

Attention:
  W_qkv: 768 × 2304 = 1,769,472
  W_out: 768 × 768  = 589,824

FFN:
  fc1: 768 × 3072 = 2,359,296
  fc2: 3072 × 768 = 2,359,296

LayerNorm (2x):
  2 × 768 × 2 = 3,072

Total per block: ~7.08M parameters
12 blocks: ~85M parameters

29.10 Testing the Block

# Create block
block = TransformerBlock(
    d_model=768,
    num_heads=12,
    d_ff=3072,
    dropout=0.1
)

# Test forward pass
x = Tensor(np.random.randn(2, 32, 768))  # (batch, seq, d_model)
y = block(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
# Both: (2, 32, 768)

print(f"Parameters: {sum(p.data.size for p in block.parameters()):,}")

29.11 Summary

Transformer block = Attention + FFN + LayerNorm + Residuals

Attention: Captures relationships between positions
FFN: Processes each position independently
LayerNorm: Stabilizes training
Residuals: Enable gradient flow

Next: Putting it all together — the complete GPT model.