12 Activation Functions

Linear layers alone can only learn linear functions. Activations add non-linearity.

12.1 The Problem with Linearity

Stack two linear layers:

\[y = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2\]

This is just another linear function! No matter how many linear layers you stack, you can only learn linear relationships.

The Iris dataset has non-linear decision boundaries. We need non-linearity.

12.2 ReLU: The Workhorse

Rectified Linear Unit — the most popular activation:

\[\text{ReLU}(x) = \max(0, x)\]

def relu(x):
    """ReLU activation: max(0, x)"""
    result_data = np.maximum(0, x.data)
    result = Tensor(result_data, requires_grad=x.requires_grad)

    if x.requires_grad:
        result.grad_fn = 'relu'
        result.parents = [x]
        result._relu_mask = (x.data > 0)  # Save for backward

    return result

Backward: \[\frac{\partial}{\partial x}\text{ReLU}(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}\]

# In backward computation
if grad_fn == 'relu':
    grad = grad_output * result._relu_mask

Note

Code Reference: See src/tensorweaver/nn/functional/__init__.py for activation implementations.

12.3 Why ReLU Works

Activation	Gradient Range	Problem
Sigmoid	(0, 0.25)	Vanishing gradients
Tanh	(0, 1)	Vanishing gradients
ReLU	{0, 1}	None (for positive)

ReLU gradients don’t shrink when multiplied:

sigmoid: 0.25 × 0.25 × 0.25 = 0.016 (vanished!)
ReLU: 1 × 1 × 1 = 1 (preserved!)

12.4 Other Activations

12.4.1 Sigmoid

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

def sigmoid(x):
    output = 1 / (1 + np.exp(-x.data))
    result = Tensor(output, requires_grad=x.requires_grad)
    if x.requires_grad:
        result.grad_fn = 'sigmoid'
        result.parents = [x]
        result._sigmoid_output = output
    return result

# Backward: σ'(x) = σ(x)(1 - σ(x))

Use case: Binary classification output (probability 0-1)

12.4.2 Tanh

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

def tanh(x):
    output = np.tanh(x.data)
    result = Tensor(output, requires_grad=x.requires_grad)
    if x.requires_grad:
        result.grad_fn = 'tanh'
        result.parents = [x]
        result._tanh_output = output
    return result

# Backward: tanh'(x) = 1 - tanh(x)²

Use case: When output needs to be in (-1, 1)

12.4.3 GELU

Gaussian Error Linear Unit — used in Transformers:

\[\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)\]

def gelu(x):
    # Approximate GELU
    output = 0.5 * x.data * (1 + np.tanh(
        np.sqrt(2 / np.pi) * (x.data + 0.044715 * x.data ** 3)
    ))
    result = Tensor(output, requires_grad=x.requires_grad)
    if x.requires_grad:
        result.grad_fn = 'gelu'
        result.parents = [x]
    return result

Use case: Transformer models (BERT, GPT)

12.4.4 Softmax

Converts logits to probabilities:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

def softmax(x, axis=-1):
    # Numerical stability: subtract max
    x_shifted = x.data - x.data.max(axis=axis, keepdims=True)
    exp_x = np.exp(x_shifted)
    output = exp_x / exp_x.sum(axis=axis, keepdims=True)
    return Tensor(output, requires_grad=x.requires_grad)

Use case: Multi-class classification output

12.5 Activation Comparison

Activation	Range	Use Case
ReLU	[0, ∞)	Hidden layers (default)
Sigmoid	(0, 1)	Binary output
Tanh	(-1, 1)	Hidden layers (alternative)
GELU	(-0.17, ∞)	Transformers
Softmax	(0, 1), sum=1	Multi-class output

12.6 Simple Network with ReLU

from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.optim import Adam

# XOR problem (non-linear!)
X = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
y = Tensor([[0], [1], [1], [0]])

# Two-layer network
W1 = Tensor(np.random.randn(2, 4) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(4), requires_grad=True)
W2 = Tensor(np.random.randn(4, 1) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(1), requires_grad=True)

optimizer = Adam([W1, b1, W2, b2], lr=0.1)

for epoch in range(1000):
    # Forward with ReLU
    h = relu(X @ W1 + b1)  # Hidden layer
    out = h @ W2 + b2       # Output

    loss = ((out - y) ** 2).mean()

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % 200 == 0:
        print(f"Epoch {epoch}: loss={loss.data:.4f}")

# Test
print(f"\nPredictions: {(out.data > 0.5).astype(int).flatten()}")
print(f"Targets:     {y.data.flatten().astype(int)}")

Without ReLU, this network cannot learn XOR!

12.7 Choosing Activations

Hidden layers: - Start with ReLU (simple, effective) - Try GELU for Transformers

Output layer: - Regression: None (linear) - Binary classification: Sigmoid - Multi-class: Softmax

12.8 Summary

Activations add non-linearity to networks
ReLU is the default for hidden layers
Sigmoid for binary, Softmax for multi-class output
GELU for Transformer architectures

Next: Regularization to prevent overfitting.