23 Introduction to GPU Computing

CPUs are fast. GPUs are massively parallel. Let’s use them.

23.1 Why GPUs?

	CPU	GPU
Cores	8-64	1000s
Clock speed	High (3-5 GHz)	Lower (1-2 GHz)
Best for	Sequential tasks	Parallel tasks
Matrix multiply	Good	Excellent

Deep learning is mostly matrix operations — perfect for GPUs.

23.2 The Speedup

Typical speedups for neural networks:

Operation	CPU → GPU Speedup
MatMul (large)	10-100x
Convolution	20-50x
Batch training	10-50x
Transformer	20-100x

23.3 GPU Libraries

Library	Description
CUDA	NVIDIA’s low-level API
cuDNN	NVIDIA’s deep learning primitives
CuPy	NumPy-like API for CUDA
cuNumeric	Drop-in NumPy replacement

23.4 Why cuNumeric?

cuNumeric is special — it’s a drop-in replacement for NumPy:

# Original code
import numpy as np
a = np.random.randn(1000, 1000)
b = np.random.randn(1000, 1000)
c = a @ b  # Runs on CPU

# With cuNumeric — same code, runs on GPU!
import cunumeric as np
a = np.random.randn(1000, 1000)
b = np.random.randn(1000, 1000)
c = a @ b  # Runs on GPU!

No code changes needed!

23.5 Installing cuNumeric

cuNumeric requires NVIDIA GPU and CUDA:

# Via conda (recommended)
conda install -c nvidia -c conda-forge -c legate cunumeric

# Or via pip
pip install cunumeric

Check installation:

import cunumeric as np
print(np.__version__)

# Test GPU
a = np.ones((1000, 1000))
b = np.ones((1000, 1000))
c = a @ b
print(f"Result sum: {c.sum()}")  # Should be 1000000

23.6 CPU vs GPU Comparison

import time
import numpy as np_cpu
import cunumeric as np_gpu

def benchmark_matmul(np_module, size=2000, runs=10):
    a = np_module.random.randn(size, size).astype(np_module.float32)
    b = np_module.random.randn(size, size).astype(np_module.float32)

    # Warmup
    _ = a @ b

    # Benchmark
    start = time.perf_counter()
    for _ in range(runs):
        c = a @ b
    elapsed = time.perf_counter() - start

    return elapsed / runs

cpu_time = benchmark_matmul(np_cpu)
gpu_time = benchmark_matmul(np_gpu)

print(f"CPU: {cpu_time*1000:.2f} ms")
print(f"GPU: {gpu_time*1000:.2f} ms")
print(f"Speedup: {cpu_time/gpu_time:.1f}x")

Typical output:

CPU: 423.15 ms
GPU: 12.34 ms
Speedup: 34.3x

23.7 How cuNumeric Works

cuNumeric uses Legate for distributed computing:

flowchart TD
    Code[Python Code] --> cuNumeric[cuNumeric API]
    cuNumeric --> Legate[Legate Runtime]
    Legate --> CPU[CPU Tasks]
    Legate --> GPU[GPU Tasks]
    Legate --> Multi[Multi-GPU/Node]

Benefits: - Automatic parallelization - Multi-GPU support - Multi-node (cluster) support - Same NumPy API

23.8 Limitations

cuNumeric isn’t perfect:

Setup complexity: Requires CUDA, conda environment
Startup overhead: First operation is slow
Small arrays: CPU may be faster for small data
Not all NumPy functions: Some missing

Rule of thumb: Use GPU for arrays > 1000 elements.

23.9 Summary

GPUs provide massive parallelism
Deep learning is ideal for GPUs
cuNumeric = drop-in NumPy replacement
Same code runs on CPU or GPU
10-100x speedups for large operations

Next: Designing a backend abstraction for TensorWeaver.