flowchart LR
X[Input] --> Q[Query]
X --> K[Key]
X --> V[Value]
Q --> Attn[Attention Scores]
K --> Attn
Attn --> Out[Weighted Sum]
V --> Out
27 The Attention Mechanism
Attention is the core of Transformers. Let’s understand it deeply.
27.1 The Intuition
Consider: “The cat sat on the mat because it was tired.”
What does “it” refer to? The cat.
Attention lets each word “look at” all other words to gather context.
27.2 Query, Key, Value
Attention uses three projections:
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What do I return?
27.3 The Attention Formula
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]
Step by step:
- Compute scores: \(QK^T\) (how similar is each query to each key)
- Scale: Divide by \(\sqrt{d_k}\) (prevents softmax saturation)
- Softmax: Convert to probabilities
- Weighted sum: Multiply by values
27.4 Implementation
import numpy as np
from tensorweaver import Tensor
from tensorweaver.nn.functional import softmax
def attention(Q, K, V, mask=None):
"""
Scaled dot-product attention.
Args:
Q: Queries tensor (batch, seq, d_k)
K: Keys tensor (batch, seq, d_k)
V: Values tensor (batch, seq, d_v)
mask: Optional attention mask tensor
Returns:
Attention output tensor (batch, seq, d_v)
"""
d_k = Q.shape[-1]
# Compute attention scores: Q @ K^T
scores = Q @ K.transpose(-2, -1)
# Scale by sqrt(d_k) to prevent softmax saturation
scores = scores / np.sqrt(d_k)
# scores: (batch, seq, seq)
# Apply mask (for causal attention)
if mask is not None:
scores = scores + mask * (-1e9)
# Softmax over keys (last dimension)
weights = softmax(scores, axis=-1)
# weights: (batch, seq, seq) - each row sums to 1
# Weighted sum of values
output = weights @ V
# output: (batch, seq, d_v)
return outputAll inputs and outputs are Tensor objects, maintaining the computational graph for backpropagation.
27.5 Self-Attention
When Q, K, V all come from the same input:
class SelfAttention(Module):
"""Single-head self-attention."""
def __init__(self, d_model):
super().__init__()
self.d_model = d_model
self.W_q = Linear(d_model, d_model, bias=False)
self.W_k = Linear(d_model, d_model, bias=False)
self.W_v = Linear(d_model, d_model, bias=False)
def forward(self, x, mask=None):
"""
Args:
x: Input (batch, seq, d_model)
mask: Attention mask
Returns:
Output (batch, seq, d_model)
"""
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
return attention(Q, K, V, mask)27.6 Causal Masking
For language models, we can’t see the future:
def create_causal_mask(seq_len):
"""
Create causal attention mask.
Position i can only attend to positions <= i.
"""
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
return Tensor(mask)Visualization:
pos0 pos1 pos2 pos3
pos0 [ 0 -∞ -∞ -∞ ] ← can only see self
pos1 [ 0 0 -∞ -∞ ] ← can see pos0, pos1
pos2 [ 0 0 0 -∞ ] ← can see pos0-2
pos3 [ 0 0 0 0 ] ← can see all
27.7 Attention Weights Visualization
For “The cat sat”:
The cat sat
The [ 0.8 0.1 0.1 ]
cat [ 0.3 0.5 0.2 ]
sat [ 0.2 0.4 0.4 ]
Each row sums to 1 (softmax). Values show how much each position attends to others.
27.8 Why Scale by √d_k?
Without scaling:
d_k = 64
Q = np.random.randn(1, 10, d_k)
K = np.random.randn(1, 10, d_k)
scores = Q @ K.T # Values can be large (~d_k)
# If d_k=64, scores could be ±8 on average
# Softmax saturates!
probs = softmax(scores) # One value ~1, others ~0With scaling:
scores = (Q @ K.T) / np.sqrt(d_k) # Values ~1
probs = softmax(scores) # Smoother distribution27.9 Complete Causal Self-Attention
class CausalSelfAttention(Module):
"""Causal self-attention for language models."""
def __init__(self, d_model, max_seq_len=1024):
super().__init__()
self.d_model = d_model
self.W_q = Linear(d_model, d_model, bias=False)
self.W_k = Linear(d_model, d_model, bias=False)
self.W_v = Linear(d_model, d_model, bias=False)
# Precompute causal mask
mask = np.triu(np.ones((max_seq_len, max_seq_len)), k=1)
self.register_buffer('mask', mask)
def forward(self, x):
batch, seq_len, _ = x.shape
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
# Attention with causal mask
scores = (Q @ K.transpose(-2, -1)) / np.sqrt(self.d_model)
scores = scores + self.mask[:seq_len, :seq_len] * (-1e9)
weights = softmax(scores, axis=-1)
output = weights @ V
return outputCode Reference: See src/tensorweaver/layers/causal_self_attention.py
27.10 Summary
- Attention computes weighted sums based on similarity
- Q, K, V: Query what to find, Key what’s available, Value what to return
- Scaling: Divide by √d_k for stable softmax
- Causal mask: Prevent looking at future tokens
- Self-attention has O(n²) complexity with sequence length
Next: Multi-head attention for richer representations.