31  Text Generation

Our GPT is trained. Let’s make it talk!

31.1 Autoregressive Generation

GPT generates one token at a time:

Input:  "The cat"
Output: "The cat sat"
         ↑ Generated

Input:  "The cat sat"
Output: "The cat sat on"
                   ↑ Generated

Input:  "The cat sat on"
Output: "The cat sat on the"
                       ↑ Generated

31.2 Basic Generation

def generate(model, idx, max_new_tokens):
    """
    Generate tokens autoregressively.

    Args:
        model: GPT model
        idx: Starting tokens (batch, seq)
        max_new_tokens: Number of tokens to generate

    Returns:
        Extended token sequence
    """
    model.eval()

    for _ in range(max_new_tokens):
        # Crop to block_size if needed
        idx_cond = idx if idx.shape[1] <= model.config.block_size else \
                   idx[:, -model.config.block_size:]

        # Get predictions
        logits, _ = model(idx_cond)

        # Focus on last position
        logits = logits[:, -1, :]  # (batch, vocab_size)

        # Greedy: pick most likely token
        idx_next = logits.data.argmax(axis=-1, keepdims=True)

        # Append to sequence
        idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))

    return idx

31.3 Temperature Sampling

Greedy generation is boring. Add randomness:

def generate_with_temperature(model, idx, max_new_tokens, temperature=1.0):
    """Generate with temperature-controlled randomness."""
    model.eval()

    for _ in range(max_new_tokens):
        idx_cond = idx if idx.shape[1] <= model.config.block_size else \
                   idx[:, -model.config.block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # Scale by temperature

        # Convert to probabilities
        probs = softmax(logits, axis=-1).data

        # Sample from distribution
        idx_next = np.array([[np.random.choice(len(probs[0]), p=probs[0])]])

        idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))

    return idx

Temperature effects:

Temperature Effect
0.0 Greedy (deterministic)
0.5 Focused but varied
1.0 Balanced
1.5 Creative, sometimes nonsense
2.0+ Random

31.4 Top-K Sampling

Only sample from top K tokens:

def generate_top_k(model, idx, max_new_tokens, temperature=1.0, top_k=50):
    """Generate with top-k sampling."""
    model.eval()

    for _ in range(max_new_tokens):
        idx_cond = idx if idx.shape[1] <= model.config.block_size else \
                   idx[:, -model.config.block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        # Keep only top K
        if top_k is not None:
            top_k_vals = np.sort(logits.data[0])[-top_k]
            logits.data[logits.data < top_k_vals] = -np.inf

        probs = softmax(logits, axis=-1).data
        idx_next = np.array([[np.random.choice(len(probs[0]), p=probs[0])]])

        idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))

    return idx

31.5 Top-P (Nucleus) Sampling

Sample from smallest set that has probability >= p:

def generate_top_p(model, idx, max_new_tokens, temperature=1.0, top_p=0.9):
    """Generate with nucleus (top-p) sampling."""
    model.eval()

    for _ in range(max_new_tokens):
        idx_cond = idx if idx.shape[1] <= model.config.block_size else \
                   idx[:, -model.config.block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        # Sort probabilities
        probs = softmax(logits, axis=-1).data[0]
        sorted_indices = np.argsort(probs)[::-1]
        sorted_probs = probs[sorted_indices]

        # Find cutoff
        cumsum = np.cumsum(sorted_probs)
        cutoff_idx = np.searchsorted(cumsum, top_p) + 1

        # Zero out low probability tokens
        mask = np.zeros_like(probs)
        mask[sorted_indices[:cutoff_idx]] = 1
        probs = probs * mask
        probs = probs / probs.sum()  # Renormalize

        idx_next = np.array([[np.random.choice(len(probs), p=probs)]])
        idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))

    return idx

31.6 Complete Generation Function

def generate_text(model, prompt, max_tokens=100,
                  temperature=0.8, top_k=50, top_p=0.9):
    """
    Generate text from a prompt.

    Args:
        model: Trained GPT model
        prompt: Text prompt
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        top_k: Top-k filtering
        top_p: Nucleus sampling threshold
    """
    enc = tiktoken.get_encoding("gpt2")

    # Encode prompt
    tokens = enc.encode(prompt)
    idx = Tensor([tokens])

    # Generate
    model.eval()
    for _ in range(max_tokens):
        idx_cond = idx if idx.shape[1] <= model.config.block_size else \
                   idx[:, -model.config.block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        # Top-k filtering
        if top_k:
            top_k_vals = np.sort(logits.data[0])[-top_k]
            logits.data[logits.data < top_k_vals] = -np.inf

        probs = softmax(logits, axis=-1).data[0]

        # Top-p filtering
        if top_p < 1.0:
            sorted_idx = np.argsort(probs)[::-1]
            cumsum = np.cumsum(probs[sorted_idx])
            cutoff = np.searchsorted(cumsum, top_p) + 1
            mask = np.zeros_like(probs)
            mask[sorted_idx[:cutoff]] = 1
            probs = probs * mask
            probs = probs / probs.sum()

        # Sample
        next_token = np.random.choice(len(probs), p=probs)
        idx = Tensor(np.concatenate([idx.data, [[next_token]]], axis=1))

        # Stop at end-of-text token
        if next_token == enc.eot_token:
            break

    # Decode
    return enc.decode(idx.data[0].astype(int).tolist())

31.7 Example Usage

# Load trained model
model = GPT(GPTConfig())
model.load_state_dict(np.load("gpt_shakespeare.npz"))

# Generate text
prompt = "To be or not to be"
generated = generate_text(
    model,
    prompt,
    max_tokens=100,
    temperature=0.8,
    top_k=50
)

print(generated)

Example output (Shakespeare-trained):

To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles...

31.8 Part VIII Complete!

Tip

The Grand Finale! You’ve built GPT from scratch!

  • ✓ Embeddings (token + position)
  • ✓ Attention mechanism
  • ✓ Multi-head attention
  • ✓ Transformer blocks
  • ✓ Complete GPT model
  • ✓ Text generation

You understand every layer, every gradient, every token.

31.9 Summary

Text generation strategies:

Method Description Use Case
Greedy Pick highest prob Fast, deterministic
Temperature Scale logits Control randomness
Top-K Sample from top K Balance quality/diversity
Top-P Sample from nucleus Adaptive filtering

Congratulations! You’ve built a complete deep learning framework and trained GPT. The principles you’ve learned apply to all modern AI systems.