Calculus for Optimization — Derivatives and Gradients
If linear algebra is the language of ML data, calculus is the engine that drives learning. Every time a neural network adjusts its weights, every time a logistic regression converges, and every time a loss function is minimised, the mechanism is rooted in one idea: the derivative. Understanding derivatives, gradients, the chain rule, and second-order quantities is what transforms the abstract notion of “the model learns” into a precise, computable process.
Why Calculus is the Engine of Machine Learning
Machine learning is, at its mathematical core, an optimisation problem. Given a model with learnable parameters θ and a dataset, we want to find the specific values of θ that minimise a loss function L(θ). The loss function measures how wrong the model’s predictions are. The question calculus answers is: in which direction should I move the parameters to make the loss smaller?
The derivative of the loss with respect to each parameter gives exactly that direction. The gradient — a vector of all partial derivatives — tells us the direction of steepest ascent in the loss landscape. Moving in the opposite direction (gradient descent) reduces the loss. This is the fundamental mechanism behind every ML optimiser from vanilla SGD to Adam.
Derivatives — The Slope at a Point
The derivative of a function f(x) at a point x measures the instantaneous rate at which f changes as x changes. Geometrically, it is the slope of the tangent line to the function’s graph at that point. Formally, it is defined as the limit of the difference quotient as the step size approaches zero.
In practice, we never compute this limit by hand for complex ML models. Instead, automatic differentiation (autograd) frameworks like PyTorch and TensorFlow compute exact derivatives analytically by applying differentiation rules to the computational graph of the function. However, understanding the differentiation rules is essential for reading research papers, deriving loss functions, and debugging training dynamics.
Why activation function derivatives matter: The vanishing gradient problem occurs when activation derivatives are near zero across many layers. ReLU avoids this because its derivative is exactly 1 for positive inputs and 0 for negative inputs — it either passes the gradient through unchanged or kills it. Sigmoid has a maximum derivative of 0.25, so stacking many sigmoid layers multiplies 0.25 repeatedly, driving gradients toward zero exponentially fast.
Derivatives of Key ML Functions
| Function | Formula | Derivative | ML Role |
|---|---|---|---|
| Sigmoid | σ(x) = 1/(1+e−x) | σ(x)(1−σ(x)) | Binary classification output, logistic regression |
| Tanh | tanh(x) = (ex−e−x)/(ex+e−x) | 1 − tanh²(x) | LSTM gates, zero-centred alternative to sigmoid |
| ReLU | max(0, x) | 1 if x > 0, else 0 | Default hidden layer activation in deep networks |
| Leaky ReLU | max(αx, x), α small | 1 if x > 0, else α | Avoids dying ReLU by allowing small negative gradient |
| Softmax | ezi / ∑j ezj | si(δij − sj) | Multi-class output layer — gradient simplifies beautifully with cross-entropy |
| MSE Loss | (1/n) ∑(yi − ŷi)² | (2/n) ∑(ŷi − yi) | Regression — gradient is proportional to the residual |
| Cross-Entropy | − ∑ yi log(ŷi) | ŷ − y (after softmax) | Classification — extremely clean gradient, fast convergence |
Partial Derivatives — Sensitivity to Each Input
Real ML models have not one but thousands or millions of parameters. A loss function that depends on all of them is a function of many variables. To optimise such a function, we need to know how the loss changes with respect to each parameter independently, while holding all others fixed. This is the partial derivative.
As a concrete example, consider the mean squared error loss for linear regression with weights w0 and w1:
The Chain Rule — The Heart of Backpropagation
The chain rule is the single most important result from calculus for deep learning. It tells us how to differentiate a composition of functions — and a neural network is precisely a deeply nested composition of functions. Every layer transforms its input with weights and an activation, and the chain rule lets us propagate the gradient of the final loss all the way back through every layer to every weight.
The chain rule states: if y = f(u) and u = g(x), then the derivative of y with respect to x is the product of the individual derivatives.
X
Z¹ = XW¹
A¹ = ReLU(Z¹)
ŷ = σ(A¹W²)
L = BCE(y, ŷ)
Each ∂/∂ term is a local derivative — computable analytically at each node. Backpropagation multiplies them in reverse order (right to left), accumulating the gradient all the way back to the first layer’s weights.
A Concrete Backpropagation Example
Consider a minimal network: one linear layer followed by sigmoid, with binary cross-entropy loss. We compute the forward pass, then derive every gradient symbolically, then verify with PyTorch autograd.
1import numpy as np 2import torch 3 4# ═══ Manual backprop: single sample, single neuron ═══════════════ 5# Network: z = w*x + b, y_hat = sigmoid(z), L = -y*log(y_hat) - (1-y)*log(1-y_hat) 6 7x, y, w, b = 2.0, 1.0, 0.5, -0.1 8 9# ── Forward pass 10z = w * x + b # z = 0.9 11y_hat = 1 / (1 + np.exp(-z)) # y_hat = sigmoid(0.9) ≈ 0.7109 12L = -(y * np.log(y_hat) + (1-y) * np.log(1-y_hat)) # BCE loss ≈ 0.3412 13 14# ── Manual backward pass using chain rule 15# dL/dy_hat = -y/y_hat + (1-y)/(1-y_hat) 16dL_dyhat = -y / y_hat + (1-y) / (1-y_hat) 17 18# dy_hat/dz = sigmoid(z) * (1 - sigmoid(z)) = y_hat * (1 - y_hat) 19dyhat_dz = y_hat * (1 - y_hat) 20 21# Chain rule: dL/dz = dL/dy_hat * dy_hat/dz = y_hat - y (beautiful simplification!) 22dL_dz = dL_dyhat * dyhat_dz # = y_hat - y 23 24# dz/dw = x, dz/db = 1 25dL_dw = dL_dz * x # = (y_hat - y) * x 26dL_db = dL_dz * 1 # = (y_hat - y) 27 28print(f"Manual dL/dw = {dL_dw:.6f}, dL/db = {dL_db:.6f}") 29 30# ═══ PyTorch autograd — verify our manual gradients ══════════════ 31w_t = torch.tensor(0.5, requires_grad=True) 32b_t = torch.tensor(-0.1, requires_grad=True) 33x_t = torch.tensor(2.0) 34y_t = torch.tensor(1.0) 35 36z_t = w_t * x_t + b_t 37yhat_t = torch.sigmoid(z_t) 38L_t = torch.nn.functional.binary_cross_entropy(yhat_t, y_t) 39L_t.backward() # compute all gradients automatically 40 41print(f"Autograd dL/dw = {w_t.grad.item():.6f}, dL/db = {b_t.grad.item():.6f}") 42# Both outputs are identical — our manual chain-rule derivation matches autograd exactly
The beautiful simplification: When sigmoid is combined with binary cross-entropy loss, the chain rule produces a strikingly clean gradient: ∂L/∂z = ŷ − y — simply the difference between the predicted probability and the true label. This is not a coincidence. The cross-entropy loss was specifically designed as the natural loss for the sigmoid, because this pairing produces the cleanest gradient and the fastest convergence. The same pattern holds for softmax combined with categorical cross-entropy.
Gradient Descent — Navigating the Loss Landscape
With the gradient computed, the update rule is straightforward: move the parameters in the direction opposite to the gradient by a small step proportional to the learning rate α. This is gradient descent, and it is the foundation of every ML optimisation algorithm.
Each step moves parameters against the gradient direction. Steps get smaller as the gradient flattens near the minimum. The learning rate α controls how far we move at each step.
The Three Variants of Gradient Descent
The vanilla gradient descent formulation uses the gradient computed over all training examples. In practice, three variants exist that trade off between gradient quality and computational efficiency.
Advanced Optimisers — Beyond Plain Gradient Descent
Vanilla gradient descent treats all parameters equally and uses the same learning rate for every weight. Modern optimisers improve on this in two ways: by incorporating momentum (using past gradients to build up speed and dampen oscillation) and by using adaptive learning rates (giving each parameter its own effective learning rate based on its history). The result is dramatically faster convergence and far less sensitivity to the initial learning rate choice.
1import numpy as np 2import torch, torch.nn as nn 3 4# ─── SGD with Momentum — implemented from scratch in NumPy ──────── 5class SGDMomentum: 6 def __init__(self, lr=0.01, beta=0.9): 7 self.lr, self.beta = lr, beta 8 self.velocity = None 9 def step(self, params, grads): 10 if self.velocity is None: 11 self.velocity = np.zeros_like(params) 12 self.velocity = self.beta * self.velocity - self.lr * grads 13 return params + self.velocity 14 15# ─── Adam — implemented from scratch in NumPy ───────────────────── 16class Adam: 17 def __init__(self, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8): 18 self.lr, self.b1, self.b2, self.eps = lr, b1, b2, eps 19 self.m = self.v = self.t = 0 20 def step(self, params, grads): 21 self.t += 1 22 self.m = self.b1 * self.m + (1 - self.b1) * grads # 1st moment 23 self.v = self.b2 * self.v + (1 - self.b2) * grads**2 # 2nd moment 24 m_hat = self.m / (1 - self.b1**self.t) # bias corrected 25 v_hat = self.v / (1 - self.b2**self.t) # bias corrected 26 return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps) 27 28# ─── Using optimisers in PyTorch (production style) ─────────────── 29model = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), 30 nn.Linear(64, 1), nn.Sigmoid()) 31opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9) 32opt_adam = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999)) 33opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01) 34 35# ─── Standard PyTorch training step (runs the same for any optimiser) 36optimizer = opt_adamw # swap to any optimiser above 37criterion = nn.BCELoss() 38 39for epoch in range(100): 40 X_batch = torch.randn(32, 10) # simulated batch 41 y_batch = torch.randint(0, 2, (32, 1)).float() 42 optimizer.zero_grad() # 1. clear accumulated gradients 43 loss = criterion(model(X_batch), y_batch) # 2. forward pass + loss 44 loss.backward() # 3. backprop — compute all gradients 45 optimizer.step() # 4. update all parameters 46 if epoch % 20 == 0: 47 print(f"Epoch {epoch:3d} | Loss: {loss.item():.4f}")
The Jacobian and Hessian — Higher-Order Derivatives
The gradient describes how a scalar loss changes with respect to a vector of parameters. When the function itself is vector-valued — that is, when we have multiple outputs — we need the Jacobian matrix. And when we want to understand not just the slope but the curvature of the loss surface, we need the Hessian matrix of second derivatives.
1import torch 2from torch.autograd.functional import jacobian, hessian 3 4# ── Define a vector-valued function for Jacobian demo 5def f_vec(x): 6 # Input: x ∈ ℝ³ | Output: [x₀², x₁*x₂, x₀+x₁+x₂] ∈ ℝ³ 7 return torch.stack([x[0]**2, x[1]*x[2], x.sum()]) 8 9x0 = torch.tensor([1.0, 2.0, 3.0]) 10J = jacobian(f_vec, x0) # (3,3) matrix — df_i/dx_j 11print("Jacobian matrix:\n", J) 12# → [[2, 0, 0], ← d(x₀²)/dx_j 13# [0, 3, 2], ← d(x₁x₂)/dx_j 14# [1, 1, 1]] ← d(x₀+x₁+x₂)/dx_j 15 16# ── Define a scalar loss for Hessian demo 17def loss_fn(w): 18 # Simple quadratic: L(w) = w₀² + 2w₁² + w₀w₁ + 3w₂² 19 return w[0]**2 + 2*w[1]**2 + w[0]*w[1] + 3*w[2]**2 20 21w0 = torch.tensor([1.0, 1.0, 1.0]) 22H = hessian(loss_fn, w0) # (3,3) symmetric matrix of 2nd derivatives 23print("Hessian matrix:\n", H) 24# → [[2, 1, 0], ← d²L/dw₀² and d²L/dw₀dw₁ 25# [1, 4, 0], ← d²L/dw₁dw₀ and d²L/dw₁² 26# [0, 0, 6]] ← d²L/dw₂² 27 28# ── Eigenvalues of the Hessian reveal the loss surface shape 29eigvals = torch.linalg.eigvalsh(H) 30print("Hessian eigenvalues:", eigvals) # all positive → strictly convex loss 31print("Condition number:", (eigvals.max()/eigvals.min()).item()) 32# A very large condition number → the loss surface is elongated like a ravine 33# → gradient descent will oscillate, momentum or Adam will converge faster
The condition number explains slow convergence. If the Hessian’s largest eigenvalue is 1000 and the smallest is 1, the condition number is 1000. This means the loss surface looks like a very thin, elongated valley. Gradient descent bounces back and forth across the narrow dimension while making slow progress along the long one. Adaptive optimisers like Adam normalise each parameter’s step by its gradient history, effectively precondioning the update to handle this imbalance — which is the primary reason Adam outperforms plain SGD on non-convex deep learning problems.
Learning Rate Scheduling and Warm-Up
The learning rate is the single most impactful hyperparameter in gradient-based optimisation. A constant learning rate is rarely optimal: early in training, a large rate accelerates convergence; later, a smaller rate prevents overshooting the minimum. Learning rate schedules systematically reduce the learning rate over training, and warm-up strategies gradually increase it at the very start.
1import torch, torch.nn as nn 2 3model = nn.Linear(10, 1) 4optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) 5 6# ── 1. StepLR: multiply LR by gamma every step_size epochs 7step_sched = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1) 8 9# ── 2. CosineAnnealingLR: smoothly decays to eta_min over T_max epochs 10cos_sched = torch.optim.lr_scheduler.CosineAnnealingLR( 11 optimizer, T_max=100, eta_min=1e-6) # default for vision models 12 13# ── 3. ReduceLROnPlateau: decay when validation loss stops improving 14plateau_sched = torch.optim.lr_scheduler.ReduceLROnPlateau( 15 optimizer, mode='min', factor=0.5, patience=10) # good for tabular ML 16 17# ── 4. Linear warm-up + cosine decay (standard for Transformers) 18def warmup_cosine_schedule(step, warmup_steps=1000, total_steps=10000): 19 import math 20 if step < warmup_steps: 21 return step / warmup_steps # linear ramp-up 22 progress = (step - warmup_steps) / (total_steps - warmup_steps) 23 return 0.5 * (1 + math.cos(math.pi * progress)) # cosine decay 24 25lambda_sched = torch.optim.lr_scheduler.LambdaLR( 26 optimizer, lr_lambda=warmup_cosine_schedule) 27 28# ── Training loop with scheduler 29scheduler = cos_sched # choose one scheduler 30for epoch in range(100): 31 # ... forward + backward + optimizer.step() ... 32 scheduler.step() # update LR after each epoch 33 if epoch % 20 == 0: 34 current_lr = optimizer.param_groups[0]['lr'] 35 print(f"Epoch {epoch:3d} | LR: {current_lr:.2e}")
Practical guidelines for learning rate selection: Start with the learning rate finder (increase LR exponentially and plot loss — the ideal LR is just before the loss starts exploding). For Adam/AdamW, 1e-3 is a reliable default. For SGD with momentum, 0.01 to 0.1 is typical. Always use a scheduler — cosine annealing for deep learning, ReduceLROnPlateau for tabular problems where validation loss is your guide.
Gradient Pathologies — Vanishing and Exploding Gradients
Two notorious problems arise from the multiplicative nature of the chain rule in deep networks. When many small numbers are multiplied together, the product approaches zero exponentially fast — this is the vanishing gradient problem. When many numbers larger than 1 are multiplied, the product explodes — this is the exploding gradient problem. Both prevent effective training of deep networks.
| Pathology | Cause | Symptom | Solution |
|---|---|---|---|
| Vanishing Gradient | Sigmoid/tanh derivatives ≤ 0.25 multiplied across many layers | Early layers receive near-zero gradients — weights stop updating — network fails to learn | ReLU activations, residual connections (ResNet), batch normalisation, better weight initialisation (He, Xavier) |
| Exploding Gradient | Weights initialised too large, very deep RNN unrolled over long sequences | NaN loss, wildly oscillating training curves, parameters become infinite | Gradient clipping (clip by norm or value), careful weight initialisation, layer normalisation |
| Dead ReLU | Neurons stuck at z < 0 — ReLU outputs 0 and gradient = 0 permanently | A fraction of neurons never activate — network capacity is wasted | Leaky ReLU, ELU, GELU activations. Lower learning rate. Careful initialisation. |
1import torch, torch.nn as nn 2 3model = nn.LSTM(input_size=32, hidden_size=128, num_layers=3) 4optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) 5criterion = nn.MSELoss() 6 7def train_step(X, y, max_grad_norm=1.0): 8 optimizer.zero_grad() 9 out, _ = model(X) 10 loss = criterion(out, y) 11 loss.backward() 12 13 # ── Monitor gradient norm before clipping 14 total_norm = 0.0 15 for p in model.parameters(): 16 if p.grad is not None: 17 total_norm += p.grad.norm().item() ** 2 18 total_norm = total_norm ** 0.5 19 20 # ── Clip by global L2 norm — rescales all gradients so their total norm ≤ max_grad_norm 21 nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm) 22 23 optimizer.step() 24 return loss.item(), total_norm 25 26# ── Weight initialisation strategies affect gradient flow from step 1 27def init_weights(m): 28 if isinstance(m, nn.Linear): 29 nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') # He init for ReLU 30 nn.init.zeros_(m.bias) 31 elif isinstance(m, nn.Embedding): 32 nn.init.normal_(m.weight, mean=0, std=0.01) # small init for embeddings 33 34model.apply(init_weights) # apply initialisation to all layers
Quick Reference — Calculus Concepts in ML
| Concept | Mathematical Object | Role in ML | Code |
|---|---|---|---|
| Derivative | df/dx — scalar | Rate of change of loss w.r.t. a single parameter | loss.backward() |
| Gradient | ∇L — vector | Direction of steepest ascent in parameter space; negate for descent | p.grad |
| Chain Rule | dL/dx = dL/du · du/dx | Propagates gradient backward through composed layers — backpropagation | Autograd builds computation graph |
| Jacobian | J ∈ ℝᵐˣⁿ — matrix | Gradient of vector-valued layer outputs w.r.t. inputs | torch.autograd.functional.jacobian |
| Hessian | H ∈ ℝⁿˣⁿ — matrix | Curvature of loss surface; informs second-order optimisers | torch.autograd.functional.hessian |
| Learning Rate | α — scalar | Step size along the negative gradient direction | optimizer = Adam(params, lr=1e-3) |
| Gradient Clipping | g ← g · c/‖g‖ if ‖g‖ > c | Prevents exploding gradients in RNNs and deep networks | clip_grad_norm_(params, 1.0) |
| Momentum | v ← βv − αg | Builds velocity in consistent gradient directions; dampens oscillation | SGD(params, momentum=0.9) |
| Adaptive LR (Adam) | θ ← θ − αm̂/√(v̂+ε) | Per-parameter learning rates based on first and second gradient moments | Adam(params, lr=1e-3) |
Key Takeaways
- The derivative measures instantaneous rate of change. In ML, it tells us how the loss changes when we nudge a single parameter by an infinitesimal amount.
- The gradient is a vector of all partial derivatives — one per parameter. It points in the direction of steepest increase in the loss, so gradient descent moves in the opposite direction.
- The chain rule is the mathematical engine of backpropagation: it decomposes the gradient of the overall loss into a product of local derivatives at each layer, enabling efficient computation from output back to input.
- When sigmoid and binary cross-entropy are combined, the gradient simplifies beautifully to ŷ − y — a design principle that generalises to softmax with categorical cross-entropy.
- The Jacobian generalises the gradient to vector-valued functions; the Hessian captures curvature. A large condition number of the Hessian explains why plain GD converges slowly on elongated loss surfaces.
- Modern optimisers (Adam, AdamW) combine momentum with adaptive per-parameter learning rates, making them robust to different gradient scales across layers — a critical advantage in deep networks.
- Vanishing gradients (small activation derivatives stacked multiplicatively) and exploding gradients (large weight-gradient products) are the primary training instability challenges. ReLU, residual connections, normalisation layers, gradient clipping, and careful initialisation are the standard mitigations.
- Always use a learning rate scheduler — cosine annealing for deep learning, ReduceLROnPlateau for tabular tasks. The learning rate is the single most impactful hyperparameter to tune.
What’s Next?
In Chapter 2.3 — Probability Theory and Distributions, we build the third pillar of ML mathematics. Where linear algebra provides the data structures and calculus provides the optimisation mechanism, probability theory provides the language for uncertainty, likelihood, and statistical inference. We will cover random variables, key distributions (Gaussian, Bernoulli, Multinomial, Poisson), expectation, variance, and the foundational principles of maximum likelihood estimation — the statistical framework behind almost every loss function in machine learning.