31 Text Generation
Our GPT is trained. Let’s make it talk!
31.1 Autoregressive Generation
GPT generates one token at a time:
Input: "The cat"
Output: "The cat sat"
↑ Generated
Input: "The cat sat"
Output: "The cat sat on"
↑ Generated
Input: "The cat sat on"
Output: "The cat sat on the"
↑ Generated
31.2 Basic Generation
def generate(model, idx, max_new_tokens):
"""
Generate tokens autoregressively.
Args:
model: GPT model
idx: Starting tokens (batch, seq)
max_new_tokens: Number of tokens to generate
Returns:
Extended token sequence
"""
model.eval()
for _ in range(max_new_tokens):
# Crop to block_size if needed
idx_cond = idx if idx.shape[1] <= model.config.block_size else \
idx[:, -model.config.block_size:]
# Get predictions
logits, _ = model(idx_cond)
# Focus on last position
logits = logits[:, -1, :] # (batch, vocab_size)
# Greedy: pick most likely token
idx_next = logits.data.argmax(axis=-1, keepdims=True)
# Append to sequence
idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))
return idx31.3 Temperature Sampling
Greedy generation is boring. Add randomness:
def generate_with_temperature(model, idx, max_new_tokens, temperature=1.0):
"""Generate with temperature-controlled randomness."""
model.eval()
for _ in range(max_new_tokens):
idx_cond = idx if idx.shape[1] <= model.config.block_size else \
idx[:, -model.config.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature # Scale by temperature
# Convert to probabilities
probs = softmax(logits, axis=-1).data
# Sample from distribution
idx_next = np.array([[np.random.choice(len(probs[0]), p=probs[0])]])
idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))
return idxTemperature effects:
| Temperature | Effect |
|---|---|
| 0.0 | Greedy (deterministic) |
| 0.5 | Focused but varied |
| 1.0 | Balanced |
| 1.5 | Creative, sometimes nonsense |
| 2.0+ | Random |
31.4 Top-K Sampling
Only sample from top K tokens:
def generate_top_k(model, idx, max_new_tokens, temperature=1.0, top_k=50):
"""Generate with top-k sampling."""
model.eval()
for _ in range(max_new_tokens):
idx_cond = idx if idx.shape[1] <= model.config.block_size else \
idx[:, -model.config.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature
# Keep only top K
if top_k is not None:
top_k_vals = np.sort(logits.data[0])[-top_k]
logits.data[logits.data < top_k_vals] = -np.inf
probs = softmax(logits, axis=-1).data
idx_next = np.array([[np.random.choice(len(probs[0]), p=probs[0])]])
idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))
return idx31.5 Top-P (Nucleus) Sampling
Sample from smallest set that has probability >= p:
def generate_top_p(model, idx, max_new_tokens, temperature=1.0, top_p=0.9):
"""Generate with nucleus (top-p) sampling."""
model.eval()
for _ in range(max_new_tokens):
idx_cond = idx if idx.shape[1] <= model.config.block_size else \
idx[:, -model.config.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature
# Sort probabilities
probs = softmax(logits, axis=-1).data[0]
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
# Find cutoff
cumsum = np.cumsum(sorted_probs)
cutoff_idx = np.searchsorted(cumsum, top_p) + 1
# Zero out low probability tokens
mask = np.zeros_like(probs)
mask[sorted_indices[:cutoff_idx]] = 1
probs = probs * mask
probs = probs / probs.sum() # Renormalize
idx_next = np.array([[np.random.choice(len(probs), p=probs)]])
idx = Tensor(np.concatenate([idx.data, idx_next], axis=1))
return idx31.6 Complete Generation Function
def generate_text(model, prompt, max_tokens=100,
temperature=0.8, top_k=50, top_p=0.9):
"""
Generate text from a prompt.
Args:
model: Trained GPT model
prompt: Text prompt
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
top_k: Top-k filtering
top_p: Nucleus sampling threshold
"""
enc = tiktoken.get_encoding("gpt2")
# Encode prompt
tokens = enc.encode(prompt)
idx = Tensor([tokens])
# Generate
model.eval()
for _ in range(max_tokens):
idx_cond = idx if idx.shape[1] <= model.config.block_size else \
idx[:, -model.config.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature
# Top-k filtering
if top_k:
top_k_vals = np.sort(logits.data[0])[-top_k]
logits.data[logits.data < top_k_vals] = -np.inf
probs = softmax(logits, axis=-1).data[0]
# Top-p filtering
if top_p < 1.0:
sorted_idx = np.argsort(probs)[::-1]
cumsum = np.cumsum(probs[sorted_idx])
cutoff = np.searchsorted(cumsum, top_p) + 1
mask = np.zeros_like(probs)
mask[sorted_idx[:cutoff]] = 1
probs = probs * mask
probs = probs / probs.sum()
# Sample
next_token = np.random.choice(len(probs), p=probs)
idx = Tensor(np.concatenate([idx.data, [[next_token]]], axis=1))
# Stop at end-of-text token
if next_token == enc.eot_token:
break
# Decode
return enc.decode(idx.data[0].astype(int).tolist())31.7 Example Usage
# Load trained model
model = GPT(GPTConfig())
model.load_state_dict(np.load("gpt_shakespeare.npz"))
# Generate text
prompt = "To be or not to be"
generated = generate_text(
model,
prompt,
max_tokens=100,
temperature=0.8,
top_k=50
)
print(generated)Example output (Shakespeare-trained):
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles...
31.8 Part VIII Complete!
Tip
The Grand Finale! You’ve built GPT from scratch!
- ✓ Embeddings (token + position)
- ✓ Attention mechanism
- ✓ Multi-head attention
- ✓ Transformer blocks
- ✓ Complete GPT model
- ✓ Text generation
You understand every layer, every gradient, every token.
31.9 Summary
Text generation strategies:
| Method | Description | Use Case |
|---|---|---|
| Greedy | Pick highest prob | Fast, deterministic |
| Temperature | Scale logits | Control randomness |
| Top-K | Sample from top K | Balance quality/diversity |
| Top-P | Sample from nucleus | Adaptive filtering |
Congratulations! You’ve built a complete deep learning framework and trained GPT. The principles you’ve learned apply to all modern AI systems.