flowchart TD
Tokens[Token IDs] --> TokEmb[Token Embedding]
Positions[Positions] --> PosEmb[Position Embedding]
TokEmb --> Add[Add]
PosEmb --> Add
Add --> Drop[Dropout]
Drop --> B1[Transformer Block 1]
B1 --> B2[Transformer Block 2]
B2 --> BN[...]
BN --> BL[Transformer Block L]
BL --> LN[Layer Norm]
LN --> Head[LM Head]
Head --> Logits[Logits]
30 Building GPT
The grand finale. Let’s build GPT from scratch.
30.1 GPT Architecture
30.2 GPT Configuration
from dataclasses import dataclass
@dataclass
class GPTConfig:
"""GPT model configuration."""
vocab_size: int = 50257 # GPT-2 vocabulary
block_size: int = 1024 # Max sequence length
n_layer: int = 12 # Number of Transformer blocks
n_head: int = 12 # Number of attention heads
n_embd: int = 768 # Embedding dimension
dropout: float = 0.1 # Dropout rate30.3 The GPT Model
class GPT(Module):
"""GPT Language Model."""
def __init__(self, config):
super().__init__()
self.config = config
# Embeddings
self.tok_emb = Embedding(config.vocab_size, config.n_embd)
self.pos_emb = Embedding(config.block_size, config.n_embd)
self.drop = Dropout(config.dropout)
# Transformer blocks
self.blocks = ModuleList([
TransformerBlock(
d_model=config.n_embd,
num_heads=config.n_head,
d_ff=4 * config.n_embd,
dropout=config.dropout
)
for _ in range(config.n_layer)
])
# Final layer norm
self.ln_f = LayerNorm(config.n_embd)
# Language model head (predicts next token)
self.lm_head = Linear(config.n_embd, config.vocab_size, bias=False)
# Weight tying: share embedding and output weights
self.lm_head.weight = self.tok_emb.weight
print(f"GPT model with {self.num_parameters():,} parameters")
def num_parameters(self):
return sum(p.data.size for p in self.parameters())
Note
Code Reference: See src/tensorweaver/layers/gpt.py for the full implementation.
30.4 Forward Pass
class GPT(Module):
# ... __init__ ...
def forward(self, idx, targets=None):
"""
Args:
idx: Token indices (batch, seq_len)
targets: Target tokens for loss computation (optional)
Returns:
logits: (batch, seq_len, vocab_size)
loss: Cross-entropy loss (if targets provided)
"""
batch, seq_len = idx.shape
assert seq_len <= self.config.block_size
# Token embeddings
tok_emb = self.tok_emb(idx) # (batch, seq, n_embd)
# Position embeddings
positions = Tensor(np.arange(seq_len))
pos_emb = self.pos_emb(positions) # (seq, n_embd)
# Combine embeddings
x = self.drop(tok_emb + pos_emb)
# Transformer blocks
for block in self.blocks:
x = block(x)
# Final layer norm
x = self.ln_f(x)
# Language model head
logits = self.lm_head(x) # (batch, seq, vocab_size)
# Compute loss if targets provided
loss = None
if targets is not None:
loss = cross_entropy_loss(
logits.reshape(-1, logits.shape[-1]),
targets.reshape(-1)
)
return logits, loss30.5 Weight Tying
Share weights between embedding and output:
self.lm_head.weight = self.tok_emb.weightWhy? - Reduces parameters (~38M for GPT-2) - Embedding and prediction are related tasks - Empirically improves performance
30.6 Training GPT
import tiktoken
from tensorweaver import Tensor
from tensorweaver.optim import AdamW
# Configuration
config = GPTConfig(
vocab_size=50257,
block_size=256,
n_layer=6,
n_head=6,
n_embd=384,
dropout=0.1
)
# Create model
model = GPT(config)
# Optimizer (AdamW is standard for Transformers)
optimizer = AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.95),
weight_decay=0.1
)
# Tokenizer
enc = tiktoken.get_encoding("gpt2")
# Prepare data
text = open("shakespeare.txt").read()
tokens = enc.encode(text)
data = Tensor(tokens)
def get_batch(batch_size, block_size):
"""Get random batch of training data."""
ix = np.random.randint(0, len(tokens) - block_size, batch_size)
x = np.stack([tokens[i:i+block_size] for i in ix])
y = np.stack([tokens[i+1:i+block_size+1] for i in ix])
return Tensor(x), Tensor(y)
# Training loop
for step in range(1000):
# Get batch
x, y = get_batch(batch_size=32, block_size=config.block_size)
# Forward
logits, loss = model(x, targets=y)
# Backward
loss.backward()
# Gradient clipping
clip_grad_norm(model.parameters(), max_norm=1.0)
# Update
optimizer.step()
optimizer.zero_grad()
if step % 100 == 0:
print(f"Step {step}: loss={loss.data:.4f}")30.7 Gradient Clipping
Essential for Transformer training:
def clip_grad_norm(parameters, max_norm):
"""Clip gradient norm to prevent explosion."""
total_norm = 0
for p in parameters:
if p.grad is not None:
total_norm += (p.grad ** 2).sum()
total_norm = np.sqrt(total_norm)
if total_norm > max_norm:
scale = max_norm / total_norm
for p in parameters:
if p.grad is not None:
p.grad *= scale
return total_norm30.8 Model Sizes
Create different GPT sizes:
def gpt_small():
return GPT(GPTConfig(n_layer=12, n_head=12, n_embd=768))
def gpt_medium():
return GPT(GPTConfig(n_layer=24, n_head=16, n_embd=1024))
def gpt_nano():
"""Tiny model for testing."""
return GPT(GPTConfig(
n_layer=4,
n_head=4,
n_embd=128,
block_size=64
))30.9 Summary
GPT consists of:
- Token embedding: Tokens → vectors
- Position embedding: Positions → vectors
- Transformer blocks: Attention + FFN
- Layer norm: Final normalization
- LM head: Vectors → vocabulary logits
Training requires: - AdamW optimizer - Gradient clipping - Learning rate warmup (in production)
Next: Generating text with our GPT.