flowchart TD
Code[Python Code] --> cuNumeric[cuNumeric API]
cuNumeric --> Legate[Legate Runtime]
Legate --> CPU[CPU Tasks]
Legate --> GPU[GPU Tasks]
Legate --> Multi[Multi-GPU/Node]
23 Introduction to GPU Computing
CPUs are fast. GPUs are massively parallel. Let’s use them.
23.1 Why GPUs?
| CPU | GPU | |
|---|---|---|
| Cores | 8-64 | 1000s |
| Clock speed | High (3-5 GHz) | Lower (1-2 GHz) |
| Best for | Sequential tasks | Parallel tasks |
| Matrix multiply | Good | Excellent |
Deep learning is mostly matrix operations — perfect for GPUs.
23.2 The Speedup
Typical speedups for neural networks:
| Operation | CPU → GPU Speedup |
|---|---|
| MatMul (large) | 10-100x |
| Convolution | 20-50x |
| Batch training | 10-50x |
| Transformer | 20-100x |
23.3 GPU Libraries
| Library | Description |
|---|---|
| CUDA | NVIDIA’s low-level API |
| cuDNN | NVIDIA’s deep learning primitives |
| CuPy | NumPy-like API for CUDA |
| cuNumeric | Drop-in NumPy replacement |
23.4 Why cuNumeric?
cuNumeric is special — it’s a drop-in replacement for NumPy:
# Original code
import numpy as np
a = np.random.randn(1000, 1000)
b = np.random.randn(1000, 1000)
c = a @ b # Runs on CPU
# With cuNumeric — same code, runs on GPU!
import cunumeric as np
a = np.random.randn(1000, 1000)
b = np.random.randn(1000, 1000)
c = a @ b # Runs on GPU!No code changes needed!
23.5 Installing cuNumeric
cuNumeric requires NVIDIA GPU and CUDA:
# Via conda (recommended)
conda install -c nvidia -c conda-forge -c legate cunumeric
# Or via pip
pip install cunumericCheck installation:
import cunumeric as np
print(np.__version__)
# Test GPU
a = np.ones((1000, 1000))
b = np.ones((1000, 1000))
c = a @ b
print(f"Result sum: {c.sum()}") # Should be 100000023.6 CPU vs GPU Comparison
import time
import numpy as np_cpu
import cunumeric as np_gpu
def benchmark_matmul(np_module, size=2000, runs=10):
a = np_module.random.randn(size, size).astype(np_module.float32)
b = np_module.random.randn(size, size).astype(np_module.float32)
# Warmup
_ = a @ b
# Benchmark
start = time.perf_counter()
for _ in range(runs):
c = a @ b
elapsed = time.perf_counter() - start
return elapsed / runs
cpu_time = benchmark_matmul(np_cpu)
gpu_time = benchmark_matmul(np_gpu)
print(f"CPU: {cpu_time*1000:.2f} ms")
print(f"GPU: {gpu_time*1000:.2f} ms")
print(f"Speedup: {cpu_time/gpu_time:.1f}x")Typical output:
CPU: 423.15 ms
GPU: 12.34 ms
Speedup: 34.3x
23.7 How cuNumeric Works
cuNumeric uses Legate for distributed computing:
Benefits: - Automatic parallelization - Multi-GPU support - Multi-node (cluster) support - Same NumPy API
23.8 Limitations
cuNumeric isn’t perfect:
- Setup complexity: Requires CUDA, conda environment
- Startup overhead: First operation is slow
- Small arrays: CPU may be faster for small data
- Not all NumPy functions: Some missing
Rule of thumb: Use GPU for arrays > 1000 elements.
23.9 Summary
- GPUs provide massive parallelism
- Deep learning is ideal for GPUs
- cuNumeric = drop-in NumPy replacement
- Same code runs on CPU or GPU
- 10-100x speedups for large operations
Next: Designing a backend abstraction for TensorWeaver.