32 Design Decisions
TensorWeaver makes deliberate trade-offs. This appendix explains why.
32.1 Philosophy: Transparency Over Performance
Decision: Prioritize debuggability over speed.
Why: - Understanding beats optimization for learning - NumPy is readable; CUDA kernels are not - Bugs should be traceable, not hidden in C++ layers
Trade-off: TensorWeaver is 100-1000x slower than PyTorch. That’s intentional.
32.2 Architecture Choices
32.2.1 Pure Python Implementation
Decision: No C/C++/CUDA extensions.
Why: - Step through any operation with pdb - Read and modify any code path - No compilation step
Alternative: Use cuNumeric for GPU acceleration without writing CUDA.
32.2.2 Eager Execution Only
Decision: No graph compilation (unlike TensorFlow 1.x or JAX’s jit).
Why: - Every operation executes immediately - Print statements work anywhere - Errors point to exact line
Trade-off: Can’t optimize across operations.
32.2.3 PyTorch-Compatible API
Decision: Match PyTorch’s interface where possible.
Why: - Familiar to most practitioners - Easy migration path - Extensive documentation available
Deviations: We simplify where PyTorch overcomplicates.
32.3 Tensor Design
32.3.1 Shape Stored as Tuple
class Tensor:
def __init__(self, data, requires_grad=False):
self.data = np.asarray(data)
self.shape = self.data.shape # Tuple, not a propertyWhy: Direct access is clearer than property indirection.
32.3.2 Gradient Accumulation
Decision: Gradients accumulate until explicitly zeroed.
# Gradients add up
loss1.backward()
loss2.backward() # Gradients from both losses
# Must manually zero
optimizer.zero_grad()Why: - Matches PyTorch behavior - Required for gradient accumulation techniques - Explicit is better than implicit
32.3.3 No In-Place Operations
Decision: Most operations create new tensors.
# Creates new tensor
y = x + 1
# Not: x += 1 (would break gradient tracking)Why: In-place operations complicate gradient computation.
32.4 Autodiff Design
32.4.1 Dynamic Computational Graph
Decision: Build graph during forward pass, discard after backward.
x = Tensor([1, 2, 3], requires_grad=True)
y = x * 2 # Creates graph node
z = y + 1 # Extends graph
z.backward() # Uses and discards graphWhy: - Different inputs can take different code paths - Natural for control flow (if/while) - Easier to debug
32.4.2 Reverse-Mode Differentiation
Decision: Compute gradients from output to input.
Why: - Efficient for many inputs, few outputs - Neural networks have millions of parameters, one loss - Forward mode would require one pass per parameter
32.4.3 Operator-Level Backward
Decision: Each operator defines its own backward pass.
class MultiplyOp(Operator):
def forward(self, a, b):
self.save_for_backward(a, b)
return a * b
def backward(self, grad_output):
a, b = self.saved_tensors
return grad_output * b, grad_output * aWhy: - Modular and extensible - Easy to verify correctness - Each backward is independent
32.5 Module System
32.5.1 Parameter Registration
Decision: Automatic parameter discovery via __setattr__.
class Linear(Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight = Tensor(...) # Auto-registered
self.bias = Tensor(...) # Auto-registeredWhy: Less boilerplate than manual registration.
32.5.2 Nested Module Support
Decision: Modules can contain other modules.
class MLP(Module):
def __init__(self):
self.fc1 = Linear(10, 20) # Child module
self.fc2 = Linear(20, 5) # Child moduleWhy: Compositional design enables complex architectures.
32.5.3 Forward Method Convention
Decision: __call__ invokes forward().
class Module:
def __call__(self, *args, **kwargs):
return self.forward(*args, **kwargs)Why: - Clean API: model(x) instead of model.forward(x) - Hooks can be added around forward()
32.6 Optimizer Design
32.6.1 State Per Parameter
Decision: Optimizers maintain state indexed by parameter object.
class Adam(Optimizer):
def __init__(self, params, lr):
self.state = {} # {param: {'m': ..., 'v': ...}}
def step(self):
for p in self.params:
state = self.state.setdefault(p, {'m': 0, 'v': 0})Why: Parameters can have different state (momentum, etc.).
32.6.2 Explicit Step
Decision: Gradients computed and applied separately.
loss.backward() # Compute gradients
optimizer.step() # Apply gradients
optimizer.zero_grad() # Clear gradientsWhy: Flexibility for gradient clipping, accumulation, debugging.
32.7 ONNX Export Design
32.7.1 Trace-Based Export
Decision: Record operations during forward pass.
def export_onnx(model, sample_input, path):
# Run forward with tracing enabled
output = model(sample_input)
# Convert recorded ops to ONNX graphWhy: - Simple implementation - Captures actual computation - Works with dynamic shapes (with caveats)
32.7.2 Standard Op Mapping
Decision: Map TensorWeaver ops to standard ONNX ops.
OP_MAPPING = {
'MatMul': 'MatMul',
'Add': 'Add',
'Relu': 'Relu',
# ...
}Why: Maximum compatibility with ONNX Runtime and other tools.
32.8 What We Don’t Do
32.8.1 No Distributed Training
Reason: Adds complexity without teaching core concepts.
32.8.2 No Custom CUDA Kernels
Reason: Focus on understanding, not performance.
32.8.3 No JIT Compilation
Reason: Eager execution is more debuggable.
32.8.4 No Automatic Mixed Precision
Reason: Adds complexity; use cuNumeric for GPU.
32.9 Summary
TensorWeaver’s design prioritizes:
- Readability: Code should be understandable
- Debuggability: Errors should be traceable
- Simplicity: Fewer features, done well
- Familiarity: PyTorch-compatible API
These choices make TensorWeaver ideal for learning, at the cost of production performance.