20 Introduction to ONNX

Training is done. Now let’s deploy to production.

20.1 What is ONNX?

Open Neural Network Exchange — a standard format for neural networks.

flowchart LR
    TW[TensorWeaver] --> ONNX[ONNX Model]
    PT[PyTorch] --> ONNX
    TF[TensorFlow] --> ONNX
    ONNX --> ORT[ONNX Runtime]
    ONNX --> Mobile[Mobile]
    ONNX --> Web[Web/WASM]
    ONNX --> Edge[Edge Devices]

20.2 Why ONNX?

Problem	ONNX Solution
Framework lock-in	Portable format
Slow inference	Optimized runtimes
Deployment complexity	Universal format
Hardware variety	Provider abstraction

20.3 ONNX Model Structure

An ONNX model is a computational graph:

ONNX Model
├── Graph
│   ├── Nodes (operations)
│   ├── Inputs (model inputs)
│   ├── Outputs (model outputs)
│   └── Initializers (weights)
└── Metadata (version, producer, etc.)

20.4 A Simple Example

Our temperature model in ONNX:

Inputs: celsius [batch, 1]
Initializers: weight [1, 1] = 1.8
              bias [1] = 32.0
Nodes:
  1. MatMul(celsius, weight) -> temp1
  2. Add(temp1, bias) -> fahrenheit
Outputs: fahrenheit [batch, 1]

20.5 ONNX Operators

ONNX defines standard operators:

Category	Operators
Math	Add, Sub, Mul, Div, MatMul
Activations	Relu, Sigmoid, Tanh, Softmax
Normalization	BatchNorm, LayerNorm
Shape	Reshape, Transpose, Concat
Reduction	ReduceMean, ReduceSum

Full list: onnx.ai/onnx/operators/

20.6 ONNX Runtime

The official inference engine:

import onnxruntime as ort

# Load model
session = ort.InferenceSession("model.onnx")

# Get input/output names
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# Run inference
result = session.run(
    [output_name],
    {input_name: input_data}
)

ONNX Runtime is highly optimized: - Graph optimization - Operator fusion - Hardware acceleration (CPU, GPU, NPU)

20.7 Execution Providers

ONNX Runtime supports multiple backends:

# CPU (default)
session = ort.InferenceSession("model.onnx",
    providers=['CPUExecutionProvider'])

# NVIDIA GPU
session = ort.InferenceSession("model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

# Intel
session = ort.InferenceSession("model.onnx",
    providers=['OpenVINOExecutionProvider'])

# Apple Silicon
session = ort.InferenceSession("model.onnx",
    providers=['CoreMLExecutionProvider'])

20.8 Our Export Pipeline

flowchart LR
    TW[TensorWeaver Model] --> Trace[Trace Forward]
    Trace --> Build[Build ONNX Graph]
    Build --> Save[Save .onnx File]
    Save --> ORT[ONNX Runtime]

Trace: Execute forward pass, record operations
Build: Convert to ONNX nodes
Save: Write to file
Run: Use ONNX Runtime

20.9 Installing Dependencies

pip install onnx onnxruntime

20.10 Verifying ONNX Models

import onnx

# Load and check model
model = onnx.load("model.onnx")
onnx.checker.check_model(model)  # Raises if invalid

# Print model info
print(f"IR version: {model.ir_version}")
print(f"Opset: {model.opset_import[0].version}")
print(f"Inputs: {[i.name for i in model.graph.input]}")
print(f"Outputs: {[o.name for o in model.graph.output]}")

20.11 Summary

ONNX = portable neural network format
ONNX Runtime = fast inference engine
Export once, run everywhere
Supports CPU, GPU, mobile, edge

Next: Building the ONNX exporter.