19  Dataset and DataLoader

Clean data handling for training. Essential for larger datasets.

19.1 The Problem

Our current approach:

# Load all data into tensors
X_train = Tensor(all_data)
y_train = Tensor(all_labels)

# Train on entire dataset each epoch
for epoch in range(epochs):
    logits = model(X_train)  # Entire dataset at once!
    loss = loss_fn(logits, y_train)

Problems: - Large datasets don’t fit in memory - No shuffling (model sees same order every epoch) - No batching control

19.2 The Dataset Class

Abstract interface for data access:

class Dataset:
    """Abstract base class for datasets."""

    def __len__(self):
        """Return number of samples."""
        raise NotImplementedError

    def __getitem__(self, idx):
        """Return sample at index."""
        raise NotImplementedError

19.3 Implementing a Dataset

class IrisDataset(Dataset):
    """Iris flower dataset."""

    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Usage
from sklearn.datasets import load_iris
iris = load_iris()
dataset = IrisDataset(iris.data, iris.target)

print(f"Dataset size: {len(dataset)}")
x, y = dataset[0]
print(f"Sample: features={x}, label={y}")

19.4 The DataLoader Class

Handles batching, shuffling, and iteration:

class DataLoader:
    """Iterates over a dataset in batches."""

    def __init__(self, dataset, batch_size=32, shuffle=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle

    def __len__(self):
        """Number of batches."""
        return (len(self.dataset) + self.batch_size - 1) // self.batch_size

    def __iter__(self):
        """Iterate over batches."""
        n = len(self.dataset)
        indices = np.arange(n)

        if self.shuffle:
            np.random.shuffle(indices)

        for start in range(0, n, self.batch_size):
            end = min(start + self.batch_size, n)
            batch_indices = indices[start:end]

            # Collect batch
            batch_x = []
            batch_y = []
            for idx in batch_indices:
                x, y = self.dataset[idx]
                batch_x.append(x)
                batch_y.append(y)

            yield Tensor(np.array(batch_x)), Tensor(np.array(batch_y))

19.5 Using DataLoader

# Create dataset and loader
dataset = IrisDataset(X_train, y_train)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Training loop with batches
for epoch in range(epochs):
    for batch_x, batch_y in loader:
        logits = model(batch_x)
        loss = cross_entropy(logits, batch_y)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

19.6 Why Batching Matters

Approach Memory Gradient Quality Speed
Full batch High Best Slow per update
Mini-batch Medium Good Fast
Single sample Low Noisy Slow overall

Mini-batch (16-128) is the sweet spot.

19.7 TensorDataset

Generic dataset from tensors:

class TensorDataset(Dataset):
    """Dataset wrapping tensors."""

    def __init__(self, *tensors):
        # All tensors must have same first dimension
        assert all(t.shape[0] == tensors[0].shape[0] for t in tensors)
        self.tensors = tensors

    def __len__(self):
        return self.tensors[0].shape[0]

    def __getitem__(self, idx):
        return tuple(t.data[idx] for t in self.tensors)

Usage:

X = Tensor(np.random.randn(100, 4))
y = Tensor(np.random.randint(0, 3, 100))

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

19.8 Complete Training Script

from tensorweaver import Tensor
from tensorweaver.nn import Module, Linear, Sequential, ReLU, Dropout
from tensorweaver.optim import Adam
from tensorweaver.data import TensorDataset, DataLoader

# Prepare data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2
)

# Normalize
mean, std = X_train.mean(0), X_train.std(0)
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std

# Create datasets and loaders
train_dataset = TensorDataset(Tensor(X_train), Tensor(y_train))
test_dataset = TensorDataset(Tensor(X_test), Tensor(y_test))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Model
model = Sequential(
    Linear(4, 16),
    ReLU(),
    Dropout(0.2),
    Linear(16, 3)
)

optimizer = Adam(model.parameters(), lr=0.01)

# Training
for epoch in range(50):
    model.train()
    total_loss = 0

    for batch_x, batch_y in train_loader:
        logits = model(batch_x)
        loss = cross_entropy(logits, batch_y)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += loss.data

    # Evaluate
    if epoch % 10 == 0:
        model.eval()
        correct = 0
        total = 0

        for batch_x, batch_y in test_loader:
            logits = model(batch_x)
            preds = logits.data.argmax(axis=-1)
            correct += (preds == batch_y.data).sum()
            total += len(batch_y.data)

        acc = correct / total
        print(f"Epoch {epoch}: loss={total_loss:.4f}, test_acc={acc:.2%}")

19.9 Part V Complete!

Tip

Milestone: You’ve built a professional training framework!

  • ✓ Module base class with auto-registration
  • ✓ Container modules (Sequential, ModuleList, ModuleDict)
  • ✓ state_dict for save/load
  • ✓ Dataset and DataLoader

Your code now looks like PyTorch!

19.10 Summary

  • Dataset: Abstract data access
  • DataLoader: Batching + shuffling + iteration
  • TensorDataset: Wraps tensors into a dataset
  • Batching is essential for:
    • Memory efficiency
    • Regularization (noisy gradients)
    • Faster training

Next: Exporting models to ONNX for production.