flowchart TD
Input[Input] --> LN1[LayerNorm]
LN1 --> Attn[Multi-Head Attention]
Attn --> Add1[Add]
Input --> Add1
Add1 --> LN2[LayerNorm]
LN2 --> FFN[Feed-Forward Network]
FFN --> Add2[Add]
Add1 --> Add2
Add2 --> Output[Output]
29 The Transformer Block
The fundamental building block of GPT. Stack these to build any Transformer.
29.1 Block Architecture
Key components: 1. Multi-head attention with causal mask 2. Feed-forward network (MLP) 3. Layer normalization 4. Residual connections
29.2 Feed-Forward Network
Simple two-layer MLP with expansion:
class FeedForward(Module):
"""Position-wise feed-forward network."""
def __init__(self, d_model, d_ff=None, dropout=0.0):
super().__init__()
d_ff = d_ff or 4 * d_model # Standard: 4x expansion
self.fc1 = Linear(d_model, d_ff)
self.fc2 = Linear(d_ff, d_model)
self.dropout = Dropout(dropout)
def forward(self, x):
x = self.fc1(x)
x = gelu(x) # GELU activation (GPT-2 uses this)
x = self.dropout(x)
x = self.fc2(x)
return xCode Reference: See src/tensorweaver/layers/mlp.py for the implementation.
29.3 Pre-Norm vs Post-Norm
Original Transformer (Post-Norm):
x = layer_norm(x + attention(x))GPT-2 and Modern (Pre-Norm):
x = x + attention(layer_norm(x))Pre-norm is more stable for deep networks.
29.4 Complete Transformer Block
class TransformerBlock(Module):
"""A single Transformer block."""
def __init__(self, d_model, num_heads, d_ff=None, dropout=0.0):
super().__init__()
# Attention
self.ln1 = LayerNorm(d_model)
self.attn = CausalMultiHeadAttention(d_model, num_heads)
self.dropout1 = Dropout(dropout)
# Feed-forward
self.ln2 = LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ff, dropout)
self.dropout2 = Dropout(dropout)
def forward(self, x):
# Attention block with residual
attn_out = self.attn(self.ln1(x))
x = x + self.dropout1(attn_out)
# FFN block with residual
ffn_out = self.ffn(self.ln2(x))
x = x + self.dropout2(ffn_out)
return xCode Reference: See src/tensorweaver/layers/transformer_block.py
29.5 Why Residual Connections?
Without residuals, deep networks suffer from: - Vanishing gradients: Gradients shrink through layers - Optimization difficulty: Hard to train
Residuals provide a “gradient highway”:
# Without residual: gradient must flow through layer
y = layer(x)
# With residual: gradient can skip layer
y = x + layer(x)
# dy/dx = 1 + dlayer/dx ← always at least 1!29.6 Why Layer Normalization?
Normalizes activations to prevent: - Exploding values - Vanishing values - Internal covariate shift
# LayerNorm normalizes across features
x_norm = (x - mean) / std29.7 Stacking Blocks
GPT stacks multiple blocks:
class TransformerStack(Module):
"""Stack of Transformer blocks."""
def __init__(self, num_layers, d_model, num_heads, d_ff=None, dropout=0.0):
super().__init__()
self.layers = ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x29.8 GPT-2 Configuration
| Model | Layers | d_model | Heads | d_ff | Parameters |
|---|---|---|---|---|---|
| Small | 12 | 768 | 12 | 3072 | 124M |
| Medium | 24 | 1024 | 16 | 4096 | 355M |
| Large | 36 | 1280 | 20 | 5120 | 774M |
| XL | 48 | 1600 | 25 | 6400 | 1.5B |
29.9 Parameter Count Per Block
For GPT-2 small (d_model=768):
Attention:
W_qkv: 768 × 2304 = 1,769,472
W_out: 768 × 768 = 589,824
FFN:
fc1: 768 × 3072 = 2,359,296
fc2: 3072 × 768 = 2,359,296
LayerNorm (2x):
2 × 768 × 2 = 3,072
Total per block: ~7.08M parameters
12 blocks: ~85M parameters
29.10 Testing the Block
# Create block
block = TransformerBlock(
d_model=768,
num_heads=12,
d_ff=3072,
dropout=0.1
)
# Test forward pass
x = Tensor(np.random.randn(2, 32, 768)) # (batch, seq, d_model)
y = block(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
# Both: (2, 32, 768)
print(f"Parameters: {sum(p.data.size for p in block.parameters()):,}")29.11 Summary
Transformer block = Attention + FFN + LayerNorm + Residuals
- Attention: Captures relationships between positions
- FFN: Processes each position independently
- LayerNorm: Stabilizes training
- Residuals: Enable gradient flow
Next: Putting it all together — the complete GPT model.