19 Dataset and DataLoader
Clean data handling for training. Essential for larger datasets.
19.1 The Problem
Our current approach:
# Load all data into tensors
X_train = Tensor(all_data)
y_train = Tensor(all_labels)
# Train on entire dataset each epoch
for epoch in range(epochs):
logits = model(X_train) # Entire dataset at once!
loss = loss_fn(logits, y_train)Problems: - Large datasets don’t fit in memory - No shuffling (model sees same order every epoch) - No batching control
19.2 The Dataset Class
Abstract interface for data access:
class Dataset:
"""Abstract base class for datasets."""
def __len__(self):
"""Return number of samples."""
raise NotImplementedError
def __getitem__(self, idx):
"""Return sample at index."""
raise NotImplementedError19.3 Implementing a Dataset
class IrisDataset(Dataset):
"""Iris flower dataset."""
def __init__(self, X, y):
self.X = X
self.y = y
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
# Usage
from sklearn.datasets import load_iris
iris = load_iris()
dataset = IrisDataset(iris.data, iris.target)
print(f"Dataset size: {len(dataset)}")
x, y = dataset[0]
print(f"Sample: features={x}, label={y}")19.4 The DataLoader Class
Handles batching, shuffling, and iteration:
class DataLoader:
"""Iterates over a dataset in batches."""
def __init__(self, dataset, batch_size=32, shuffle=False):
self.dataset = dataset
self.batch_size = batch_size
self.shuffle = shuffle
def __len__(self):
"""Number of batches."""
return (len(self.dataset) + self.batch_size - 1) // self.batch_size
def __iter__(self):
"""Iterate over batches."""
n = len(self.dataset)
indices = np.arange(n)
if self.shuffle:
np.random.shuffle(indices)
for start in range(0, n, self.batch_size):
end = min(start + self.batch_size, n)
batch_indices = indices[start:end]
# Collect batch
batch_x = []
batch_y = []
for idx in batch_indices:
x, y = self.dataset[idx]
batch_x.append(x)
batch_y.append(y)
yield Tensor(np.array(batch_x)), Tensor(np.array(batch_y))19.5 Using DataLoader
# Create dataset and loader
dataset = IrisDataset(X_train, y_train)
loader = DataLoader(dataset, batch_size=16, shuffle=True)
# Training loop with batches
for epoch in range(epochs):
for batch_x, batch_y in loader:
logits = model(batch_x)
loss = cross_entropy(logits, batch_y)
loss.backward()
optimizer.step()
optimizer.zero_grad()19.6 Why Batching Matters
| Approach | Memory | Gradient Quality | Speed |
|---|---|---|---|
| Full batch | High | Best | Slow per update |
| Mini-batch | Medium | Good | Fast |
| Single sample | Low | Noisy | Slow overall |
Mini-batch (16-128) is the sweet spot.
19.7 TensorDataset
Generic dataset from tensors:
class TensorDataset(Dataset):
"""Dataset wrapping tensors."""
def __init__(self, *tensors):
# All tensors must have same first dimension
assert all(t.shape[0] == tensors[0].shape[0] for t in tensors)
self.tensors = tensors
def __len__(self):
return self.tensors[0].shape[0]
def __getitem__(self, idx):
return tuple(t.data[idx] for t in self.tensors)Usage:
X = Tensor(np.random.randn(100, 4))
y = Tensor(np.random.randint(0, 3, 100))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=16, shuffle=True)19.8 Complete Training Script
from tensorweaver import Tensor
from tensorweaver.nn import Module, Linear, Sequential, ReLU, Dropout
from tensorweaver.optim import Adam
from tensorweaver.data import TensorDataset, DataLoader
# Prepare data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2
)
# Normalize
mean, std = X_train.mean(0), X_train.std(0)
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std
# Create datasets and loaders
train_dataset = TensorDataset(Tensor(X_train), Tensor(y_train))
test_dataset = TensorDataset(Tensor(X_test), Tensor(y_test))
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)
# Model
model = Sequential(
Linear(4, 16),
ReLU(),
Dropout(0.2),
Linear(16, 3)
)
optimizer = Adam(model.parameters(), lr=0.01)
# Training
for epoch in range(50):
model.train()
total_loss = 0
for batch_x, batch_y in train_loader:
logits = model(batch_x)
loss = cross_entropy(logits, batch_y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
total_loss += loss.data
# Evaluate
if epoch % 10 == 0:
model.eval()
correct = 0
total = 0
for batch_x, batch_y in test_loader:
logits = model(batch_x)
preds = logits.data.argmax(axis=-1)
correct += (preds == batch_y.data).sum()
total += len(batch_y.data)
acc = correct / total
print(f"Epoch {epoch}: loss={total_loss:.4f}, test_acc={acc:.2%}")19.9 Part V Complete!
Tip
Milestone: You’ve built a professional training framework!
- ✓ Module base class with auto-registration
- ✓ Container modules (Sequential, ModuleList, ModuleDict)
- ✓ state_dict for save/load
- ✓ Dataset and DataLoader
Your code now looks like PyTorch!
19.10 Summary
- Dataset: Abstract data access
- DataLoader: Batching + shuffling + iteration
- TensorDataset: Wraps tensors into a dataset
- Batching is essential for:
- Memory efficiency
- Regularization (noisy gradients)
- Faster training
Next: Exporting models to ONNX for production.