Calculus for Optimization — Derivatives and Gradients

If linear algebra is the language of ML data, calculus is the engine that drives learning. Every time a neural network adjusts its weights, every time a logistic regression converges, and every time a loss function is minimised, the mechanism is rooted in one idea: the derivative. Understanding derivatives, gradients, the chain rule, and second-order quantities is what transforms the abstract notion of “the model learns” into a precise, computable process.

Why Calculus is the Engine of Machine Learning

Machine learning is, at its mathematical core, an optimisation problem. Given a model with learnable parameters θ and a dataset, we want to find the specific values of θ that minimise a loss function L(θ). The loss function measures how wrong the model’s predictions are. The question calculus answers is: in which direction should I move the parameters to make the loss smaller?

The derivative of the loss with respect to each parameter gives exactly that direction. The gradient — a vector of all partial derivatives — tells us the direction of steepest ascent in the loss landscape. Moving in the opposite direction (gradient descent) reduces the loss. This is the fundamental mechanism behind every ML optimiser from vanilla SGD to Adam.

df/dx
Derivative
The instantaneous rate of change of a scalar function with respect to a single scalar variable. Tells you the slope of the function at a point.
∇f
Gradient
A vector of all partial derivatives of a scalar function with respect to all its inputs. Points in the direction of steepest increase.
J
Jacobian
A matrix of all partial derivatives of a vector-valued function. Generalises the gradient to functions with multiple outputs.
H
Hessian
A matrix of second-order partial derivatives. Describes the curvature of the loss surface, used in second-order optimisers like Newton’s method.

Derivatives — The Slope at a Point

The derivative of a function f(x) at a point x measures the instantaneous rate at which f changes as x changes. Geometrically, it is the slope of the tangent line to the function’s graph at that point. Formally, it is defined as the limit of the difference quotient as the step size approaches zero.

f‘(x) = limh→0 f(x + h) f(x) h = df dx
The formal limit definition of the derivative — the foundation of all ML gradient computations

In practice, we never compute this limit by hand for complex ML models. Instead, automatic differentiation (autograd) frameworks like PyTorch and TensorFlow compute exact derivatives analytically by applying differentiation rules to the computational graph of the function. However, understanding the differentiation rules is essential for reading research papers, deriving loss functions, and debugging training dynamics.

Core Differentiation Rules
Power Rule
d/dx [xn] = nxn−1
d/dx [x³] = 3x² — appears in polynomial loss surfaces
Constant Rule
d/dx [c] = 0
Constants (bias terms at fixed layers) do not affect the derivative direction
Sum Rule
d/dx [f + g] = f’ + g’
Gradient of a sum = sum of gradients — used when summing per-sample losses
Product Rule
d/dx [fg] = f’g + fg’
Appears when differentiating attention score computations
Chain Rule
d/dx [f(g(x))] = f'(g(x)) · g'(x)
The engine of backpropagation — every nested function in a network
Exponential
d/dx [ex] = ex
Softmax, sigmoid, and cross-entropy derivatives all involve this
Natural Log
d/dx [ln x] = 1/x
Cross-entropy loss gradient: −d/dx [y ln ŷ] produces the clean gradient ŷ − y
Sigmoid
σ'(x) = σ(x)(1 − σ(x))
Backprop through sigmoid — the output itself is reused in the gradient

Why activation function derivatives matter: The vanishing gradient problem occurs when activation derivatives are near zero across many layers. ReLU avoids this because its derivative is exactly 1 for positive inputs and 0 for negative inputs — it either passes the gradient through unchanged or kills it. Sigmoid has a maximum derivative of 0.25, so stacking many sigmoid layers multiplies 0.25 repeatedly, driving gradients toward zero exponentially fast.

Derivatives of Key ML Functions

FunctionFormulaDerivativeML Role
Sigmoid σ(x) = 1/(1+e−x) σ(x)(1−σ(x)) Binary classification output, logistic regression
Tanh tanh(x) = (ex−e−x)/(ex+e−x) 1 − tanh²(x) LSTM gates, zero-centred alternative to sigmoid
ReLU max(0, x) 1 if x > 0, else 0 Default hidden layer activation in deep networks
Leaky ReLU max(αx, x), α small 1 if x > 0, else α Avoids dying ReLU by allowing small negative gradient
Softmax ezi / ∑j ezj siij − sj) Multi-class output layer — gradient simplifies beautifully with cross-entropy
MSE Loss (1/n) ∑(yi − ŷi (2/n) ∑(ŷi − yi) Regression — gradient is proportional to the residual
Cross-Entropy − ∑ yi log(ŷi) ŷ − y (after softmax) Classification — extremely clean gradient, fast convergence

Partial Derivatives — Sensitivity to Each Input

Real ML models have not one but thousands or millions of parameters. A loss function that depends on all of them is a function of many variables. To optimise such a function, we need to know how the loss changes with respect to each parameter independently, while holding all others fixed. This is the partial derivative.

f xi = limh→0 f(…, xi+h, …) f(…, xi, …) h
Partial derivative with respect to x_i — all other variables held constant

As a concrete example, consider the mean squared error loss for linear regression with weights w0 and w1:

L(w0, w1) = 1 n (yi (w0 + w1xi))2
MSE loss for simple linear regression — a function of two parameters w_0 and w_1
Partial w.r.t. w0 (intercept)
∂L/∂w0 = −(2/n) ∑(yi − ŷi)
The gradient with respect to the bias is the average residual. If this is positive, we overpredict on average and must decrease w0.
Partial w.r.t. w1 (slope)
∂L/∂w1 = −(2/n) ∑xi(yi − ŷi)
The gradient with respect to the slope is the residual weighted by the input. Features with large values have proportionally larger influence on this gradient.
The Gradient Vector
wL = [∂L/∂w0, ∂L/∂w1]T
The gradient stacks all partial derivatives into one vector pointing in the direction of steepest loss increase. We move in the opposite direction.
Gradient Descent Step
w ← w − α ∇wL
Subtract a fraction α (learning rate) of the gradient from the current weights. Repeat until the gradient is approximately zero — the minimum.

The Chain Rule — The Heart of Backpropagation

The chain rule is the single most important result from calculus for deep learning. It tells us how to differentiate a composition of functions — and a neural network is precisely a deeply nested composition of functions. Every layer transforms its input with weights and an activation, and the chain rule lets us propagate the gradient of the final loss all the way back through every layer to every weight.

The chain rule states: if y = f(u) and u = g(x), then the derivative of y with respect to x is the product of the individual derivatives.

dy dx = dy du · du dx Multivariate: L x = k L uk · uk x
Chain rule — scalar and multivariate forms. In a neural network, the gradient flows backward through each layer by multiplying local Jacobians.
Backpropagation — Chain Rule Through a 3-Layer Network
Input
X
Layer 1
Z¹ = XW¹
Layer 2
A¹ = ReLU(Z¹)
Layer 3
ŷ = σ(A¹W²)
Loss
L = BCE(y, ŷ)
∂L/∂W¹  =  ∂L/∂ŷ  ·  ∂ŷ/∂Z²  ·  ∂Z²/∂A¹  ·  ∂A¹/∂Z¹  ·  ∂Z¹/∂W¹

Each ∂/∂ term is a local derivative — computable analytically at each node. Backpropagation multiplies them in reverse order (right to left), accumulating the gradient all the way back to the first layer’s weights.

A Concrete Backpropagation Example

Consider a minimal network: one linear layer followed by sigmoid, with binary cross-entropy loss. We compute the forward pass, then derive every gradient symbolically, then verify with PyTorch autograd.

Python — Backprop by Hand vs PyTorch Autograd
 1import numpy as np
 2import torch
 3
 4# ═══ Manual backprop: single sample, single neuron ═══════════════
 5# Network: z = w*x + b,  y_hat = sigmoid(z),  L = -y*log(y_hat) - (1-y)*log(1-y_hat)
 6
 7x, y, w, b = 2.0, 1.0, 0.5, -0.1
 8
 9# ── Forward pass
10z     = w * x + b                       # z = 0.9
11y_hat = 1 / (1 + np.exp(-z))           # y_hat = sigmoid(0.9) ≈ 0.7109
12L     = -(y * np.log(y_hat) + (1-y) * np.log(1-y_hat))  # BCE loss ≈ 0.3412
13
14# ── Manual backward pass using chain rule
15# dL/dy_hat = -y/y_hat + (1-y)/(1-y_hat)
16dL_dyhat = -y / y_hat + (1-y) / (1-y_hat)
17
18# dy_hat/dz = sigmoid(z) * (1 - sigmoid(z)) = y_hat * (1 - y_hat)
19dyhat_dz = y_hat * (1 - y_hat)
20
21# Chain rule: dL/dz = dL/dy_hat * dy_hat/dz = y_hat - y  (beautiful simplification!)
22dL_dz = dL_dyhat * dyhat_dz          # = y_hat - y
23
24# dz/dw = x,  dz/db = 1
25dL_dw = dL_dz * x                    # = (y_hat - y) * x
26dL_db = dL_dz * 1                    # = (y_hat - y)
27
28print(f"Manual  dL/dw = {dL_dw:.6f},  dL/db = {dL_db:.6f}")
29
30# ═══ PyTorch autograd — verify our manual gradients ══════════════
31w_t = torch.tensor(0.5,  requires_grad=True)
32b_t = torch.tensor(-0.1, requires_grad=True)
33x_t = torch.tensor(2.0)
34y_t = torch.tensor(1.0)
35
36z_t     = w_t * x_t + b_t
37yhat_t  = torch.sigmoid(z_t)
38L_t     = torch.nn.functional.binary_cross_entropy(yhat_t, y_t)
39L_t.backward()                          # compute all gradients automatically
40
41print(f"Autograd dL/dw = {w_t.grad.item():.6f},  dL/db = {b_t.grad.item():.6f}")
42# Both outputs are identical — our manual chain-rule derivation matches autograd exactly

The beautiful simplification: When sigmoid is combined with binary cross-entropy loss, the chain rule produces a strikingly clean gradient: ∂L/∂z = ŷ − y — simply the difference between the predicted probability and the true label. This is not a coincidence. The cross-entropy loss was specifically designed as the natural loss for the sigmoid, because this pairing produces the cleanest gradient and the fastest convergence. The same pattern holds for softmax combined with categorical cross-entropy.


Gradient Descent — Navigating the Loss Landscape

With the gradient computed, the update rule is straightforward: move the parameters in the direction opposite to the gradient by a small step proportional to the learning rate α. This is gradient descent, and it is the foundation of every ML optimisation algorithm.

θt+1 = θt α · θ L(θt)
The gradient descent parameter update. α is the learning rate (step size). ∇L is the gradient vector of the loss.
Gradient Descent on a 2D Loss Surface
Global Min Start θ = θ₀ ∇L (gradient) w₁ w₂ High loss Medium loss GD path (θ) Gradient ∇L

Each step moves parameters against the gradient direction. Steps get smaller as the gradient flattens near the minimum. The learning rate α controls how far we move at each step.

The Three Variants of Gradient Descent

The vanilla gradient descent formulation uses the gradient computed over all training examples. In practice, three variants exist that trade off between gradient quality and computational efficiency.

Gradient Descent Variants — Comparison
Batch GD
θ ← θ − α (1/n) ∑i=1n ∇L(xi, θ)
Uses all n samples per update. Exact gradient, but prohibitively slow for large datasets. Each epoch = 1 update.
SGD
θ ← θ − α ∇L(xi, θ)
Uses 1 random sample per update. Very noisy gradient but very fast. Noise can help escape local minima.
Mini-batch GD
θ ← θ − α (1/B) ∑i∈B ∇L(xi, θ)
Standard in practice. Batch size B (32–512). Balances gradient accuracy with speed. Parallelises efficiently on GPU.

Advanced Optimisers — Beyond Plain Gradient Descent

Vanilla gradient descent treats all parameters equally and uses the same learning rate for every weight. Modern optimisers improve on this in two ways: by incorporating momentum (using past gradients to build up speed and dampen oscillation) and by using adaptive learning rates (giving each parameter its own effective learning rate based on its history). The result is dramatically faster convergence and far less sensitivity to the initial learning rate choice.

Classic
SGD with Momentum
v ← βv − α∇L    θ ← θ + v
Accelerates in consistent directions. Dampens oscillations in ravines.
Still requires careful learning rate tuning. β typically 0.9.
Classic
Nesterov (NAG)
v ← βv − α∇L(θ+βv)   θ ← θ + v
Looks ahead before computing gradient. Corrects overshoot before it happens.
More complex implementation but consistently outperforms vanilla momentum.
Classic
AdaGrad
G ← G + g²   θ ← θ − αg/√(G+ε)
Per-parameter adaptive LR. Excellent for sparse features (NLP).
Learning rate monotonically decreases — may stop learning too early in long training.
Classic
RMSProp
G ← βG + (1−β)g²   θ ← θ − αg/√(G+ε)
Fixes AdaGrad’s decaying LR. Exponential moving average of squared gradients.
No bias correction. β typically 0.999.
Industry Standard
Adam
m ← β₁m+(1−β₁)g   v ← β₂v+(1−β₂)g²   θ ← θ − αm̂/√(v̂+ε)
Momentum + adaptive LR + bias correction. Works well out-of-the-box.
Can generalise worse than SGD+momentum in some settings (convex problems).
Modern
AdamW
Adam update + θ ← θ − ηθ (decoupled weight decay)
Fixes Adam’s weight decay. Default choice for Transformer models (BERT, GPT).
Requires tuning weight decay η separately from learning rate.
Python — Optimisers from Scratch and in PyTorch
 1import numpy as np
 2import torch, torch.nn as nn
 3
 4# ─── SGD with Momentum — implemented from scratch in NumPy ────────
 5class SGDMomentum:
 6    def __init__(self, lr=0.01, beta=0.9):
 7        self.lr, self.beta = lr, beta
 8        self.velocity = None
 9    def step(self, params, grads):
10        if self.velocity is None:
11            self.velocity = np.zeros_like(params)
12        self.velocity = self.beta * self.velocity - self.lr * grads
13        return params + self.velocity
14
15# ─── Adam — implemented from scratch in NumPy ─────────────────────
16class Adam:
17    def __init__(self, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8):
18        self.lr, self.b1, self.b2, self.eps = lr, b1, b2, eps
19        self.m = self.v = self.t = 0
20    def step(self, params, grads):
21        self.t += 1
22        self.m = self.b1 * self.m + (1 - self.b1) * grads     # 1st moment
23        self.v = self.b2 * self.v + (1 - self.b2) * grads**2   # 2nd moment
24        m_hat = self.m / (1 - self.b1**self.t)              # bias corrected
25        v_hat = self.v / (1 - self.b2**self.t)              # bias corrected
26        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
27
28# ─── Using optimisers in PyTorch (production style) ───────────────
29model  = nn.Sequential(nn.Linear(10, 64), nn.ReLU(),
30                           nn.Linear(64, 1),  nn.Sigmoid())
31opt_sgd   = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
32opt_adam  = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
33opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
34
35# ─── Standard PyTorch training step (runs the same for any optimiser)
36optimizer = opt_adamw               # swap to any optimiser above
37criterion = nn.BCELoss()
38
39for epoch in range(100):
40    X_batch = torch.randn(32, 10)       # simulated batch
41    y_batch = torch.randint(0, 2, (32, 1)).float()
42    optimizer.zero_grad()             # 1. clear accumulated gradients
43    loss = criterion(model(X_batch), y_batch)  # 2. forward pass + loss
44    loss.backward()                   # 3. backprop — compute all gradients
45    optimizer.step()                   # 4. update all parameters
46    if epoch % 20 == 0:
47        print(f"Epoch {epoch:3d} | Loss: {loss.item():.4f}")

The Jacobian and Hessian — Higher-Order Derivatives

The gradient describes how a scalar loss changes with respect to a vector of parameters. When the function itself is vector-valued — that is, when we have multiple outputs — we need the Jacobian matrix. And when we want to understand not just the slope but the curvature of the loss surface, we need the Hessian matrix of second derivatives.

Jf = f x = [fi/xj]     H = [²L/xixj]
Jacobian: (m × n) matrix for a function ℝⁿ → ℝᵐ. Hessian: (n × n) symmetric matrix of second derivatives of a scalar function ℝⁿ → ℝ.
J
Jacobian in Practice
The Jacobian of the output with respect to the input is computed during the backward pass of every layer. In backpropagation, multiplying Jacobians in reverse order is what propagates the gradient through the network. PyTorch’s torch.autograd.functional.jacobian computes this explicitly.
H
Hessian and Curvature
The Hessian reveals the shape of the loss surface. A positive definite Hessian means the loss is convex at that point — a unique minimum exists. Newton’s method uses the Hessian to take curved steps that converge in far fewer iterations. Full Hessian computation costs O(n²) memory — infeasible for large networks, so approximations (L-BFGS, K-FAC) are used.
∇²
Newton’s Method
Uses curvature: θ ← θ − H−1∇L. Takes a scaled step that accounts for the curvature at each point. Converges in far fewer steps than gradient descent. Too expensive for deep networks but used in small-scale convex ML problems.
  ∞
Saddle Points
In high-dimensional loss surfaces (millions of parameters), flat regions and saddle points — where the gradient is zero but it is not a minimum — are far more common than local minima. The Hessian’s mixed positive and negative eigenvalues identify saddle points, explaining why momentum-based optimisers escape them faster than plain GD.
Python — Jacobian and Hessian with PyTorch Autograd
 1import torch
 2from torch.autograd.functional import jacobian, hessian
 3
 4# ── Define a vector-valued function for Jacobian demo
 5def f_vec(x):
 6    # Input: x ∈ ℝ³  |  Output: [x₀², x₁*x₂, x₀+x₁+x₂] ∈ ℝ³
 7    return torch.stack([x[0]**2, x[1]*x[2], x.sum()])
 8
 9x0 = torch.tensor([1.0, 2.0, 3.0])
10J = jacobian(f_vec, x0)           # (3,3) matrix — df_i/dx_j
11print("Jacobian matrix:\n", J)
12# → [[2, 0, 0],   ← d(x₀²)/dx_j
13#    [0, 3, 2],   ← d(x₁x₂)/dx_j
14#    [1, 1, 1]]   ← d(x₀+x₁+x₂)/dx_j
15
16# ── Define a scalar loss for Hessian demo
17def loss_fn(w):
18    # Simple quadratic: L(w) = w₀² + 2w₁² + w₀w₁ + 3w₂²
19    return w[0]**2 + 2*w[1]**2 + w[0]*w[1] + 3*w[2]**2
20
21w0 = torch.tensor([1.0, 1.0, 1.0])
22H = hessian(loss_fn, w0)            # (3,3) symmetric matrix of 2nd derivatives
23print("Hessian matrix:\n", H)
24# → [[2, 1, 0],   ← d²L/dw₀² and d²L/dw₀dw₁
25#    [1, 4, 0],   ← d²L/dw₁dw₀ and d²L/dw₁²
26#    [0, 0, 6]]   ← d²L/dw₂²
27
28# ── Eigenvalues of the Hessian reveal the loss surface shape
29eigvals = torch.linalg.eigvalsh(H)
30print("Hessian eigenvalues:", eigvals)     # all positive → strictly convex loss
31print("Condition number:", (eigvals.max()/eigvals.min()).item())
32# A very large condition number → the loss surface is elongated like a ravine
33# → gradient descent will oscillate, momentum or Adam will converge faster

The condition number explains slow convergence. If the Hessian’s largest eigenvalue is 1000 and the smallest is 1, the condition number is 1000. This means the loss surface looks like a very thin, elongated valley. Gradient descent bounces back and forth across the narrow dimension while making slow progress along the long one. Adaptive optimisers like Adam normalise each parameter’s step by its gradient history, effectively precondioning the update to handle this imbalance — which is the primary reason Adam outperforms plain SGD on non-convex deep learning problems.


Learning Rate Scheduling and Warm-Up

The learning rate is the single most impactful hyperparameter in gradient-based optimisation. A constant learning rate is rarely optimal: early in training, a large rate accelerates convergence; later, a smaller rate prevents overshooting the minimum. Learning rate schedules systematically reduce the learning rate over training, and warm-up strategies gradually increase it at the very start.

Python — Learning Rate Schedulers in PyTorch
 1import torch, torch.nn as nn
 2
 3model     = nn.Linear(10, 1)
 4optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
 5
 6# ── 1. StepLR: multiply LR by gamma every step_size epochs
 7step_sched = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
 8
 9# ── 2. CosineAnnealingLR: smoothly decays to eta_min over T_max epochs
10cos_sched  = torch.optim.lr_scheduler.CosineAnnealingLR(
11    optimizer, T_max=100, eta_min=1e-6)         # default for vision models
12
13# ── 3. ReduceLROnPlateau: decay when validation loss stops improving
14plateau_sched = torch.optim.lr_scheduler.ReduceLROnPlateau(
15    optimizer, mode='min', factor=0.5, patience=10)  # good for tabular ML
16
17# ── 4. Linear warm-up + cosine decay (standard for Transformers)
18def warmup_cosine_schedule(step, warmup_steps=1000, total_steps=10000):
19    import math
20    if step < warmup_steps:
21        return step / warmup_steps                    # linear ramp-up
22    progress = (step - warmup_steps) / (total_steps - warmup_steps)
23    return 0.5 * (1 + math.cos(math.pi * progress))  # cosine decay
24
25lambda_sched = torch.optim.lr_scheduler.LambdaLR(
26    optimizer, lr_lambda=warmup_cosine_schedule)
27
28# ── Training loop with scheduler
29scheduler = cos_sched               # choose one scheduler
30for epoch in range(100):
31    # ... forward + backward + optimizer.step() ...
32    scheduler.step()                  # update LR after each epoch
33    if epoch % 20 == 0:
34        current_lr = optimizer.param_groups[0]['lr']
35        print(f"Epoch {epoch:3d} | LR: {current_lr:.2e}")

Practical guidelines for learning rate selection: Start with the learning rate finder (increase LR exponentially and plot loss — the ideal LR is just before the loss starts exploding). For Adam/AdamW, 1e-3 is a reliable default. For SGD with momentum, 0.01 to 0.1 is typical. Always use a scheduler — cosine annealing for deep learning, ReduceLROnPlateau for tabular problems where validation loss is your guide.


Gradient Pathologies — Vanishing and Exploding Gradients

Two notorious problems arise from the multiplicative nature of the chain rule in deep networks. When many small numbers are multiplied together, the product approaches zero exponentially fast — this is the vanishing gradient problem. When many numbers larger than 1 are multiplied, the product explodes — this is the exploding gradient problem. Both prevent effective training of deep networks.

PathologyCauseSymptomSolution
Vanishing Gradient Sigmoid/tanh derivatives ≤ 0.25 multiplied across many layers Early layers receive near-zero gradients — weights stop updating — network fails to learn ReLU activations, residual connections (ResNet), batch normalisation, better weight initialisation (He, Xavier)
Exploding Gradient Weights initialised too large, very deep RNN unrolled over long sequences NaN loss, wildly oscillating training curves, parameters become infinite Gradient clipping (clip by norm or value), careful weight initialisation, layer normalisation
Dead ReLU Neurons stuck at z < 0 — ReLU outputs 0 and gradient = 0 permanently A fraction of neurons never activate — network capacity is wasted Leaky ReLU, ELU, GELU activations. Lower learning rate. Careful initialisation.
Python — Gradient Clipping and Monitoring
 1import torch, torch.nn as nn
 2
 3model     = nn.LSTM(input_size=32, hidden_size=128, num_layers=3)
 4optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 5criterion = nn.MSELoss()
 6
 7def train_step(X, y, max_grad_norm=1.0):
 8    optimizer.zero_grad()
 9    out, _ = model(X)
10    loss = criterion(out, y)
11    loss.backward()
12
13    # ── Monitor gradient norm before clipping
14    total_norm = 0.0
15    for p in model.parameters():
16        if p.grad is not None:
17            total_norm += p.grad.norm().item() ** 2
18    total_norm = total_norm ** 0.5
19
20    # ── Clip by global L2 norm — rescales all gradients so their total norm ≤ max_grad_norm
21    nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)
22
23    optimizer.step()
24    return loss.item(), total_norm
25
26# ── Weight initialisation strategies affect gradient flow from step 1
27def init_weights(m):
28    if isinstance(m, nn.Linear):
29        nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')  # He init for ReLU
30        nn.init.zeros_(m.bias)
31    elif isinstance(m, nn.Embedding):
32        nn.init.normal_(m.weight, mean=0, std=0.01)      # small init for embeddings
33
34model.apply(init_weights)           # apply initialisation to all layers

Quick Reference — Calculus Concepts in ML

ConceptMathematical ObjectRole in MLCode
Derivative df/dx — scalar Rate of change of loss w.r.t. a single parameter loss.backward()
Gradient ∇L — vector Direction of steepest ascent in parameter space; negate for descent p.grad
Chain Rule dL/dx = dL/du · du/dx Propagates gradient backward through composed layers — backpropagation Autograd builds computation graph
Jacobian J ∈ ℝᵐˣⁿ — matrix Gradient of vector-valued layer outputs w.r.t. inputs torch.autograd.functional.jacobian
Hessian H ∈ ℝⁿˣⁿ — matrix Curvature of loss surface; informs second-order optimisers torch.autograd.functional.hessian
Learning Rate α — scalar Step size along the negative gradient direction optimizer = Adam(params, lr=1e-3)
Gradient Clipping g ← g · c/‖g‖ if ‖g‖ > c Prevents exploding gradients in RNNs and deep networks clip_grad_norm_(params, 1.0)
Momentum v ← βv − αg Builds velocity in consistent gradient directions; dampens oscillation SGD(params, momentum=0.9)
Adaptive LR (Adam) θ ← θ − αm̂/√(v̂+ε) Per-parameter learning rates based on first and second gradient moments Adam(params, lr=1e-3)

Key Takeaways

  • The derivative measures instantaneous rate of change. In ML, it tells us how the loss changes when we nudge a single parameter by an infinitesimal amount.
  • The gradient is a vector of all partial derivatives — one per parameter. It points in the direction of steepest increase in the loss, so gradient descent moves in the opposite direction.
  • The chain rule is the mathematical engine of backpropagation: it decomposes the gradient of the overall loss into a product of local derivatives at each layer, enabling efficient computation from output back to input.
  • When sigmoid and binary cross-entropy are combined, the gradient simplifies beautifully to ŷ − y — a design principle that generalises to softmax with categorical cross-entropy.
  • The Jacobian generalises the gradient to vector-valued functions; the Hessian captures curvature. A large condition number of the Hessian explains why plain GD converges slowly on elongated loss surfaces.
  • Modern optimisers (Adam, AdamW) combine momentum with adaptive per-parameter learning rates, making them robust to different gradient scales across layers — a critical advantage in deep networks.
  • Vanishing gradients (small activation derivatives stacked multiplicatively) and exploding gradients (large weight-gradient products) are the primary training instability challenges. ReLU, residual connections, normalisation layers, gradient clipping, and careful initialisation are the standard mitigations.
  • Always use a learning rate scheduler — cosine annealing for deep learning, ReduceLROnPlateau for tabular tasks. The learning rate is the single most impactful hyperparameter to tune.

What’s Next?

In Chapter 2.3 — Probability Theory and Distributions, we build the third pillar of ML mathematics. Where linear algebra provides the data structures and calculus provides the optimisation mechanism, probability theory provides the language for uncertainty, likelihood, and statistical inference. We will cover random variables, key distributions (Gaussian, Bernoulli, Multinomial, Poisson), expectation, variance, and the foundational principles of maximum likelihood estimation — the statistical framework behind almost every loss function in machine learning.