30  Building GPT

The grand finale. Let’s build GPT from scratch.

30.1 GPT Architecture

flowchart TD
    Tokens[Token IDs] --> TokEmb[Token Embedding]
    Positions[Positions] --> PosEmb[Position Embedding]
    TokEmb --> Add[Add]
    PosEmb --> Add
    Add --> Drop[Dropout]
    Drop --> B1[Transformer Block 1]
    B1 --> B2[Transformer Block 2]
    B2 --> BN[...]
    BN --> BL[Transformer Block L]
    BL --> LN[Layer Norm]
    LN --> Head[LM Head]
    Head --> Logits[Logits]

30.2 GPT Configuration

from dataclasses import dataclass

@dataclass
class GPTConfig:
    """GPT model configuration."""
    vocab_size: int = 50257      # GPT-2 vocabulary
    block_size: int = 1024       # Max sequence length
    n_layer: int = 12            # Number of Transformer blocks
    n_head: int = 12             # Number of attention heads
    n_embd: int = 768            # Embedding dimension
    dropout: float = 0.1         # Dropout rate

30.3 The GPT Model

class GPT(Module):
    """GPT Language Model."""

    def __init__(self, config):
        super().__init__()
        self.config = config

        # Embeddings
        self.tok_emb = Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = Embedding(config.block_size, config.n_embd)
        self.drop = Dropout(config.dropout)

        # Transformer blocks
        self.blocks = ModuleList([
            TransformerBlock(
                d_model=config.n_embd,
                num_heads=config.n_head,
                d_ff=4 * config.n_embd,
                dropout=config.dropout
            )
            for _ in range(config.n_layer)
        ])

        # Final layer norm
        self.ln_f = LayerNorm(config.n_embd)

        # Language model head (predicts next token)
        self.lm_head = Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying: share embedding and output weights
        self.lm_head.weight = self.tok_emb.weight

        print(f"GPT model with {self.num_parameters():,} parameters")

    def num_parameters(self):
        return sum(p.data.size for p in self.parameters())
Note

Code Reference: See src/tensorweaver/layers/gpt.py for the full implementation.

30.4 Forward Pass

class GPT(Module):
    # ... __init__ ...

    def forward(self, idx, targets=None):
        """
        Args:
            idx: Token indices (batch, seq_len)
            targets: Target tokens for loss computation (optional)

        Returns:
            logits: (batch, seq_len, vocab_size)
            loss: Cross-entropy loss (if targets provided)
        """
        batch, seq_len = idx.shape
        assert seq_len <= self.config.block_size

        # Token embeddings
        tok_emb = self.tok_emb(idx)  # (batch, seq, n_embd)

        # Position embeddings
        positions = Tensor(np.arange(seq_len))
        pos_emb = self.pos_emb(positions)  # (seq, n_embd)

        # Combine embeddings
        x = self.drop(tok_emb + pos_emb)

        # Transformer blocks
        for block in self.blocks:
            x = block(x)

        # Final layer norm
        x = self.ln_f(x)

        # Language model head
        logits = self.lm_head(x)  # (batch, seq, vocab_size)

        # Compute loss if targets provided
        loss = None
        if targets is not None:
            loss = cross_entropy_loss(
                logits.reshape(-1, logits.shape[-1]),
                targets.reshape(-1)
            )

        return logits, loss

30.5 Weight Tying

Share weights between embedding and output:

self.lm_head.weight = self.tok_emb.weight

Why? - Reduces parameters (~38M for GPT-2) - Embedding and prediction are related tasks - Empirically improves performance

30.6 Training GPT

import tiktoken
from tensorweaver import Tensor
from tensorweaver.optim import AdamW

# Configuration
config = GPTConfig(
    vocab_size=50257,
    block_size=256,
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1
)

# Create model
model = GPT(config)

# Optimizer (AdamW is standard for Transformers)
optimizer = AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1
)

# Tokenizer
enc = tiktoken.get_encoding("gpt2")

# Prepare data
text = open("shakespeare.txt").read()
tokens = enc.encode(text)
data = Tensor(tokens)

def get_batch(batch_size, block_size):
    """Get random batch of training data."""
    ix = np.random.randint(0, len(tokens) - block_size, batch_size)
    x = np.stack([tokens[i:i+block_size] for i in ix])
    y = np.stack([tokens[i+1:i+block_size+1] for i in ix])
    return Tensor(x), Tensor(y)

# Training loop
for step in range(1000):
    # Get batch
    x, y = get_batch(batch_size=32, block_size=config.block_size)

    # Forward
    logits, loss = model(x, targets=y)

    # Backward
    loss.backward()

    # Gradient clipping
    clip_grad_norm(model.parameters(), max_norm=1.0)

    # Update
    optimizer.step()
    optimizer.zero_grad()

    if step % 100 == 0:
        print(f"Step {step}: loss={loss.data:.4f}")

30.7 Gradient Clipping

Essential for Transformer training:

def clip_grad_norm(parameters, max_norm):
    """Clip gradient norm to prevent explosion."""
    total_norm = 0
    for p in parameters:
        if p.grad is not None:
            total_norm += (p.grad ** 2).sum()
    total_norm = np.sqrt(total_norm)

    if total_norm > max_norm:
        scale = max_norm / total_norm
        for p in parameters:
            if p.grad is not None:
                p.grad *= scale

    return total_norm

30.8 Model Sizes

Create different GPT sizes:

def gpt_small():
    return GPT(GPTConfig(n_layer=12, n_head=12, n_embd=768))

def gpt_medium():
    return GPT(GPTConfig(n_layer=24, n_head=16, n_embd=1024))

def gpt_nano():
    """Tiny model for testing."""
    return GPT(GPTConfig(
        n_layer=4,
        n_head=4,
        n_embd=128,
        block_size=64
    ))

30.9 Summary

GPT consists of:

  1. Token embedding: Tokens → vectors
  2. Position embedding: Positions → vectors
  3. Transformer blocks: Attention + FFN
  4. Layer norm: Final normalization
  5. LM head: Vectors → vocabulary logits

Training requires: - AdamW optimizer - Gradient clipping - Learning rate warmup (in production)

Next: Generating text with our GPT.