12 Activation Functions
Linear layers alone can only learn linear functions. Activations add non-linearity.
12.1 The Problem with Linearity
Stack two linear layers:
\[y = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2\]
This is just another linear function! No matter how many linear layers you stack, you can only learn linear relationships.
The Iris dataset has non-linear decision boundaries. We need non-linearity.
12.2 ReLU: The Workhorse
Rectified Linear Unit — the most popular activation:
\[\text{ReLU}(x) = \max(0, x)\]
def relu(x):
"""ReLU activation: max(0, x)"""
result_data = np.maximum(0, x.data)
result = Tensor(result_data, requires_grad=x.requires_grad)
if x.requires_grad:
result.grad_fn = 'relu'
result.parents = [x]
result._relu_mask = (x.data > 0) # Save for backward
return resultBackward: \[\frac{\partial}{\partial x}\text{ReLU}(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}\]
# In backward computation
if grad_fn == 'relu':
grad = grad_output * result._relu_maskCode Reference: See src/tensorweaver/nn/functional/__init__.py for activation implementations.
12.3 Why ReLU Works
| Activation | Gradient Range | Problem |
|---|---|---|
| Sigmoid | (0, 0.25) | Vanishing gradients |
| Tanh | (0, 1) | Vanishing gradients |
| ReLU | {0, 1} | None (for positive) |
ReLU gradients don’t shrink when multiplied:
sigmoid: 0.25 × 0.25 × 0.25 = 0.016 (vanished!)
ReLU: 1 × 1 × 1 = 1 (preserved!)
12.4 Other Activations
12.4.1 Sigmoid
\[\sigma(x) = \frac{1}{1 + e^{-x}}\]
def sigmoid(x):
output = 1 / (1 + np.exp(-x.data))
result = Tensor(output, requires_grad=x.requires_grad)
if x.requires_grad:
result.grad_fn = 'sigmoid'
result.parents = [x]
result._sigmoid_output = output
return result
# Backward: σ'(x) = σ(x)(1 - σ(x))Use case: Binary classification output (probability 0-1)
12.4.2 Tanh
\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]
def tanh(x):
output = np.tanh(x.data)
result = Tensor(output, requires_grad=x.requires_grad)
if x.requires_grad:
result.grad_fn = 'tanh'
result.parents = [x]
result._tanh_output = output
return result
# Backward: tanh'(x) = 1 - tanh(x)²Use case: When output needs to be in (-1, 1)
12.4.3 GELU
Gaussian Error Linear Unit — used in Transformers:
\[\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)\]
def gelu(x):
# Approximate GELU
output = 0.5 * x.data * (1 + np.tanh(
np.sqrt(2 / np.pi) * (x.data + 0.044715 * x.data ** 3)
))
result = Tensor(output, requires_grad=x.requires_grad)
if x.requires_grad:
result.grad_fn = 'gelu'
result.parents = [x]
return resultUse case: Transformer models (BERT, GPT)
12.4.4 Softmax
Converts logits to probabilities:
\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]
def softmax(x, axis=-1):
# Numerical stability: subtract max
x_shifted = x.data - x.data.max(axis=axis, keepdims=True)
exp_x = np.exp(x_shifted)
output = exp_x / exp_x.sum(axis=axis, keepdims=True)
return Tensor(output, requires_grad=x.requires_grad)Use case: Multi-class classification output
12.5 Activation Comparison
| Activation | Range | Use Case |
|---|---|---|
| ReLU | [0, ∞) | Hidden layers (default) |
| Sigmoid | (0, 1) | Binary output |
| Tanh | (-1, 1) | Hidden layers (alternative) |
| GELU | (-0.17, ∞) | Transformers |
| Softmax | (0, 1), sum=1 | Multi-class output |
12.6 Simple Network with ReLU
from tensorweaver import Tensor
from tensorweaver.nn.functional import relu
from tensorweaver.optim import Adam
# XOR problem (non-linear!)
X = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
y = Tensor([[0], [1], [1], [0]])
# Two-layer network
W1 = Tensor(np.random.randn(2, 4) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros(4), requires_grad=True)
W2 = Tensor(np.random.randn(4, 1) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros(1), requires_grad=True)
optimizer = Adam([W1, b1, W2, b2], lr=0.1)
for epoch in range(1000):
# Forward with ReLU
h = relu(X @ W1 + b1) # Hidden layer
out = h @ W2 + b2 # Output
loss = ((out - y) ** 2).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
if epoch % 200 == 0:
print(f"Epoch {epoch}: loss={loss.data:.4f}")
# Test
print(f"\nPredictions: {(out.data > 0.5).astype(int).flatten()}")
print(f"Targets: {y.data.flatten().astype(int)}")Without ReLU, this network cannot learn XOR!
12.7 Choosing Activations
Hidden layers: - Start with ReLU (simple, effective) - Try GELU for Transformers
Output layer: - Regression: None (linear) - Binary classification: Sigmoid - Multi-class: Softmax
12.8 Summary
- Activations add non-linearity to networks
- ReLU is the default for hidden layers
- Sigmoid for binary, Softmax for multi-class output
- GELU for Transformer architectures
Next: Regularization to prevent overfitting.