26 Embedding Layers
GPT starts with embeddings — converting tokens to vectors.
26.1 What is an Embedding?
Tokens are integers. Neural networks need vectors.
Token: "hello" → 1234
Embedding: 1234 → [0.2, -0.5, 0.8, ..., 0.1] (768 dims)
An embedding layer is just a lookup table:
class Embedding(Module):
"""Lookup table for token embeddings."""
def __init__(self, num_embeddings, embedding_dim):
"""
Args:
num_embeddings: Vocabulary size
embedding_dim: Dimension of each embedding
"""
super().__init__()
# Initialize with small random values
self.weight = Tensor(
np.random.randn(num_embeddings, embedding_dim) * 0.02,
requires_grad=True
)
def forward(self, indices):
"""
Look up embeddings for token indices.
Args:
indices: Integer tensor of shape (batch, seq_len)
Returns:
Tensor of shape (batch, seq_len, embedding_dim)
"""
return Tensor(self.weight.data[indices.data.astype(int)])
Note
Code Reference: See src/tensorweaver/layers/embedding.py for the implementation.
26.2 Token Embedding
Each token in vocabulary gets a vector:
# GPT-2 has 50257 tokens, 768 dimensions
token_embedding = Embedding(50257, 768)
# Example: encode "Hello world"
tokens = Tensor([[15496, 995]]) # Token IDs
embeddings = token_embedding(tokens)
print(embeddings.shape) # (1, 2, 768)26.3 Position Embedding
Transformers don’t know order. We add position information:
class PositionEmbedding(Module):
"""Learnable position embeddings."""
def __init__(self, max_seq_len, embedding_dim):
super().__init__()
self.weight = Tensor(
np.random.randn(max_seq_len, embedding_dim) * 0.02,
requires_grad=True
)
def forward(self, seq_len):
"""Return position embeddings for sequence length."""
return self.weight.data[:seq_len]26.4 Combined Embeddings
GPT combines token and position embeddings:
class GPTEmbedding(Module):
"""Token + Position embeddings for GPT."""
def __init__(self, vocab_size, max_seq_len, embedding_dim):
super().__init__()
self.token_embedding = Embedding(vocab_size, embedding_dim)
self.position_embedding = Embedding(max_seq_len, embedding_dim)
def forward(self, tokens):
"""
Args:
tokens: (batch, seq_len) token indices
Returns:
(batch, seq_len, embedding_dim) embeddings
"""
batch_size, seq_len = tokens.shape
# Token embeddings
tok_emb = self.token_embedding(tokens)
# Position embeddings
positions = Tensor(np.arange(seq_len))
pos_emb = self.position_embedding(positions)
# Combine
return tok_emb + pos_emb26.5 Embedding Backward
The gradient for embedding lookup:
def embedding_backward(grad_output, indices, vocab_size):
"""
Compute gradient for embedding layer.
Gradient accumulates at the looked-up indices.
"""
grad_weight = np.zeros((vocab_size, grad_output.shape[-1]))
# Accumulate gradients for each token
for i, idx in enumerate(indices.flatten()):
grad_weight[int(idx)] += grad_output.reshape(-1, grad_output.shape[-1])[i]
return grad_weight26.6 Tokenization (Brief Overview)
GPT uses Byte Pair Encoding (BPE):
import tiktoken
# GPT-2 tokenizer
enc = tiktoken.get_encoding("gpt2")
# Encode text to tokens
text = "Hello, world!"
tokens = enc.encode(text)
print(tokens) # [15496, 11, 995, 0]
# Decode tokens to text
decoded = enc.decode(tokens)
print(decoded) # "Hello, world!"We won’t implement a tokenizer — use tiktoken or HuggingFace.
26.7 Complete Embedding Example
import tiktoken
import numpy as np
from tensorweaver import Tensor
from tensorweaver.nn import Module, Embedding
# Tokenizer
enc = tiktoken.get_encoding("gpt2")
# Embedding layer
embedding = GPTEmbedding(
vocab_size=50257,
max_seq_len=1024,
embedding_dim=768
)
# Tokenize text
text = "The quick brown fox"
tokens = enc.encode(text)
print(f"Tokens: {tokens}")
# Get embeddings
token_tensor = Tensor([tokens]) # Add batch dimension
embeddings = embedding(token_tensor)
print(f"Embeddings shape: {embeddings.shape}")
# (1, 4, 768)26.8 Summary
- Embedding = lookup table (indices → vectors)
- Token embedding: Each vocabulary word gets a vector
- Position embedding: Each position gets a vector
- Embeddings are learned during training
- Use
tiktokenfor tokenization
Next: The attention mechanism.