12  Activation Functions

Linear layers alone can only learn linear functions. Activations add non-linearity.

12.1 The Problem with Linearity

Stack two linear layers:

\[y = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2\]

This is just another linear function! No matter how many linear layers you stack, you can only learn linear relationships.

The Iris dataset has non-linear decision boundaries. We need non-linearity.

12.2 ReLU: The Workhorse

Rectified Linear Unit — the most popular activation:

\[\text{ReLU}(x) = \max(0, x)\]

def relu(x):
    """ReLU activation: max(0, x)"""
    result_data = np.maximum(0, x.data)
    result = Tensor(result_data, requires_grad=x.requires_grad)

    if x.requires_grad:
        result.grad_fn = 'relu'
        result.parents = [x]
        result._relu_mask = (x.data > 0)  # Save for backward

    return result

Backward: \[\frac{\partial}{\partial x}\text{ReLU}(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}\]

# In backward computation
if grad_fn == 'relu':
    grad = grad_output * result._relu_mask
Note

Code Reference: See src/tensorweaver/nn/functional/__init__.py for activation implementations.

12.3 Why ReLU Works

Activation Gradient Range Problem
Sigmoid (0, 0.25) Vanishing gradients
Tanh (0, 1) Vanishing gradients
ReLU {0, 1} None (for positive)

ReLU gradients don’t shrink when multiplied:

sigmoid: 0.25 × 0.25 × 0.25 = 0.016 (vanished!)
ReLU: 1 × 1 × 1 = 1 (preserved!)

12.4 Other Activations

12.4.1 Sigmoid

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

def sigmoid(x):
    output = 1 / (1 + np.exp(-x.data))
    result = Tensor(output, requires_grad=x.requires_grad)
    if x.requires_grad:
        result.grad_fn = 'sigmoid'
        result.parents = [x]
        result._sigmoid_output = output
    return result

# Backward: σ'(x) = σ(x)(1 - σ(x))

Use case: Binary classification output (probability 0-1)

12.4.2 Tanh

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

def tanh(x):
    output = np.tanh(x.data)
    result = Tensor(output, requires_grad=x.requires_grad)
    if x.requires_grad:
        result.grad_fn = 'tanh'
        result.parents = [x]
        result._tanh_output = output
    return result

# Backward: tanh'(x) = 1 - tanh(x)²

Use case: When output needs to be in (-1, 1)

12.4.3 GELU

Gaussian Error Linear Unit — used in Transformers:

\[\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)\]

def gelu(x):
    # Approximate GELU
    output = 0.5 * x.data * (1 + np.tanh(
        np.sqrt(2 / np.pi) * (x.data + 0.044715 * x.data ** 3)
    ))
    result = Tensor(output, requires_grad=x.requires_grad)
    if x.requires_grad:
        result.grad_fn = 'gelu'
        result.parents = [x]
    return result

Use case: Transformer models (BERT, GPT)

12.4.4 Softmax

Converts logits to probabilities:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

def softmax(x, axis=-1):
    # Numerical stability: subtract max
    x_shifted = x.data - x.data.max(axis=axis, keepdims=True)
    exp_x = np.exp(x_shifted)
    output = exp_x / exp_x.sum(axis=axis, keepdims=True)
    return Tensor(output, requires_grad=x.requires_grad)

Use case: Multi-class classification output

12.5 Activation Comparison

Activation Range Use Case
ReLU [0, ∞) Hidden layers (default)
Sigmoid (0, 1) Binary output
Tanh (-1, 1) Hidden layers (alternative)
GELU (-0.17, ∞) Transformers
Softmax (0, 1), sum=1 Multi-class output

12.6 Simple Network with ReLU

from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.optim import Adam

# XOR problem (non-linear!)
X = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
y = Tensor([[0], [1], [1], [0]])

# Two-layer network
W1 = Tensor(np.random.randn(2, 4) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(4), requires_grad=True)
W2 = Tensor(np.random.randn(4, 1) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(1), requires_grad=True)

optimizer = Adam([W1, b1, W2, b2], lr=0.1)

for epoch in range(1000):
    # Forward with ReLU
    h = relu(X @ W1 + b1)  # Hidden layer
    out = h @ W2 + b2       # Output

    loss = ((out - y) ** 2).mean()

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % 200 == 0:
        print(f"Epoch {epoch}: loss={loss.data:.4f}")

# Test
print(f"\nPredictions: {(out.data > 0.5).astype(int).flatten()}")
print(f"Targets:     {y.data.flatten().astype(int)}")

Without ReLU, this network cannot learn XOR!

12.7 Choosing Activations

Hidden layers: - Start with ReLU (simple, effective) - Try GELU for Transformers

Output layer: - Regression: None (linear) - Binary classification: Sigmoid - Multi-class: Softmax

12.8 Summary

  • Activations add non-linearity to networks
  • ReLU is the default for hidden layers
  • Sigmoid for binary, Softmax for multi-class output
  • GELU for Transformer architectures

Next: Regularization to prevent overfitting.