32 Design Decisions

TensorWeaver makes deliberate trade-offs. This appendix explains why.

32.1 Philosophy: Transparency Over Performance

Decision: Prioritize debuggability over speed.

Why: - Understanding beats optimization for learning - NumPy is readable; CUDA kernels are not - Bugs should be traceable, not hidden in C++ layers

Trade-off: TensorWeaver is 100-1000x slower than PyTorch. That’s intentional.

32.2 Architecture Choices

32.2.1 Pure Python Implementation

Decision: No C/C++/CUDA extensions.

Why: - Step through any operation with pdb - Read and modify any code path - No compilation step

Alternative: Use cuNumeric for GPU acceleration without writing CUDA.

32.2.2 Eager Execution Only

Decision: No graph compilation (unlike TensorFlow 1.x or JAX’s jit).

Why: - Every operation executes immediately - Print statements work anywhere - Errors point to exact line

Trade-off: Can’t optimize across operations.

32.2.3 PyTorch-Compatible API

Decision: Match PyTorch’s interface where possible.

Why: - Familiar to most practitioners - Easy migration path - Extensive documentation available

Deviations: We simplify where PyTorch overcomplicates.

32.3 Tensor Design

32.3.1 Shape Stored as Tuple

class Tensor:
    def __init__(self, data, requires_grad=False):
        self.data = np.asarray(data)
        self.shape = self.data.shape  # Tuple, not a property

Why: Direct access is clearer than property indirection.

32.3.2 Gradient Accumulation

Decision: Gradients accumulate until explicitly zeroed.

# Gradients add up
loss1.backward()
loss2.backward()  # Gradients from both losses

# Must manually zero
optimizer.zero_grad()

Why: - Matches PyTorch behavior - Required for gradient accumulation techniques - Explicit is better than implicit

32.3.3 No In-Place Operations

Decision: Most operations create new tensors.

# Creates new tensor
y = x + 1

# Not: x += 1 (would break gradient tracking)

Why: In-place operations complicate gradient computation.

32.4 Autodiff Design

32.4.1 Dynamic Computational Graph

Decision: Build graph during forward pass, discard after backward.

x = Tensor([1, 2, 3], requires_grad=True)
y = x * 2    # Creates graph node
z = y + 1    # Extends graph
z.backward() # Uses and discards graph

Why: - Different inputs can take different code paths - Natural for control flow (if/while) - Easier to debug

32.4.2 Reverse-Mode Differentiation

Decision: Compute gradients from output to input.

Why: - Efficient for many inputs, few outputs - Neural networks have millions of parameters, one loss - Forward mode would require one pass per parameter

32.4.3 Operator-Level Backward

Decision: Each operator defines its own backward pass.

class MultiplyOp(Operator):
    def forward(self, a, b):
        self.save_for_backward(a, b)
        return a * b

    def backward(self, grad_output):
        a, b = self.saved_tensors
        return grad_output * b, grad_output * a

Why: - Modular and extensible - Easy to verify correctness - Each backward is independent

32.5 Module System

32.5.1 Parameter Registration

Decision: Automatic parameter discovery via __setattr__.

class Linear(Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = Tensor(...)  # Auto-registered
        self.bias = Tensor(...)    # Auto-registered

Why: Less boilerplate than manual registration.

32.5.2 Nested Module Support

Decision: Modules can contain other modules.

class MLP(Module):
    def __init__(self):
        self.fc1 = Linear(10, 20)  # Child module
        self.fc2 = Linear(20, 5)   # Child module

Why: Compositional design enables complex architectures.

32.5.3 Forward Method Convention

Decision: __call__ invokes forward().

class Module:
    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)

Why: - Clean API: model(x) instead of model.forward(x) - Hooks can be added around forward()

32.6 Optimizer Design

32.6.1 State Per Parameter

Decision: Optimizers maintain state indexed by parameter object.

class Adam(Optimizer):
    def __init__(self, params, lr):
        self.state = {}  # {param: {'m': ..., 'v': ...}}

    def step(self):
        for p in self.params:
            state = self.state.setdefault(p, {'m': 0, 'v': 0})

Why: Parameters can have different state (momentum, etc.).

32.6.2 Explicit Step

Decision: Gradients computed and applied separately.

loss.backward()     # Compute gradients
optimizer.step()    # Apply gradients
optimizer.zero_grad()  # Clear gradients

Why: Flexibility for gradient clipping, accumulation, debugging.

32.7 ONNX Export Design

32.7.1 Trace-Based Export

Decision: Record operations during forward pass.

def export_onnx(model, sample_input, path):
    # Run forward with tracing enabled
    output = model(sample_input)
    # Convert recorded ops to ONNX graph

Why: - Simple implementation - Captures actual computation - Works with dynamic shapes (with caveats)

32.7.2 Standard Op Mapping

Decision: Map TensorWeaver ops to standard ONNX ops.

OP_MAPPING = {
    'MatMul': 'MatMul',
    'Add': 'Add',
    'Relu': 'Relu',
    # ...
}

Why: Maximum compatibility with ONNX Runtime and other tools.

32.8 What We Don’t Do

32.8.1 No Distributed Training

Reason: Adds complexity without teaching core concepts.

32.8.2 No Custom CUDA Kernels

Reason: Focus on understanding, not performance.

32.8.3 No JIT Compilation

Reason: Eager execution is more debuggable.

32.8.4 No Automatic Mixed Precision

Reason: Adds complexity; use cuNumeric for GPU.

32.9 Summary

TensorWeaver’s design prioritizes:

Readability: Code should be understandable
Debuggability: Errors should be traceable
Simplicity: Fewer features, done well
Familiarity: PyTorch-compatible API

These choices make TensorWeaver ideal for learning, at the cost of production performance.