Linear Algebra for Machine Learning

Linear algebra is the language of machine learning. Every dataset is a matrix. Every prediction is a dot product. Every dimensionality reduction, neural network weight update, and similarity computation is, at its core, a linear algebra operation. Understanding these foundations is not optional — it is what separates a practitioner who runs algorithms from one who truly understands what is happening inside them.

Why Linear Algebra is Central to ML

When you call model.fit(X, y), dozens of linear algebra operations execute behind the scenes. When a neural network performs a forward pass, it is computing matrix multiplications. When PCA reduces 500 features to 10, it is performing an eigendecomposition. When you compute cosine similarity between two document embeddings, you are computing a normalised dot product.

The reason GPUs accelerate ML so dramatically is that they are architecturally optimised for one specific class of computation: batched matrix operations. Every hardware and software breakthrough in deep learning — from CUDA to TensorFlow to PyTorch — is fundamentally an innovation in making linear algebra faster.

This chapter builds your intuition and computational fluency with the six core linear algebra concepts that appear repeatedly throughout the ML curriculum: scalars, vectors, matrices, and tensors; vector and matrix operations; special matrices; linear transformations; eigenvalues and eigenvectors; and Singular Value Decomposition (SVD).

Linear Algebra Operations Inside ML Algorithms

Linear Regression

w = (X^TX)⁻¹ X^Ty

The closed-form Normal Equation solves for weights using matrix inversion and matrix-vector products.

Neural Networks

Z = XW + b

Every dense layer computes a matrix multiplication of inputs by weight matrix, plus a bias vector.

PCA

C = (1/n) X^TX, then eigendecomposition

The covariance matrix is formed via matrix multiplication and decomposed via eigenanalysis to find principal components.

Cosine Similarity

cos(θ) = (a·b) / (‖a‖ ‖b‖)

Used in recommendation systems and NLP, this is a normalised dot product between two vectors.

Scalars, Vectors, Matrices, and Tensors

Before any operation can be performed, we must understand the fundamental data structures that linear algebra works with. These four types form a natural hierarchy, each being a generalisation of the previous.

Scalar

0-dimensional

A single real number — no direction, no structure. Just a magnitude.

Learning rate α = 0.01, loss = 0.342, temperature = 37.2

Vector

1-dimensional, shape (n,)

An ordered list of numbers representing a point or direction in n-dimensional space.

One data point with 3 features: [age=25, income=50000, tenure=3]

Matrix

2-dimensional, shape (m, n)

A rectangular array of numbers with rows and columns. The fundamental structure for datasets.

Dataset X with 1000 samples and 20 features has shape (1000, 20)

Tensor

n-dimensional generalisation

A generalisation of matrices to any number of dimensions. The core data type in deep learning frameworks.

A batch of 32 RGB images of size 224×224 is a tensor of shape (32, 224, 224, 3)

Visual Representation — Scalar, Vector, Matrix, and 3D Tensor

Scalar (s)

[

3.14

]

shape: ()

Column Vector (v)

[

]

shape: (4,) or (4,1)

Matrix (A)

[

]

shape: (3, 3)

3D Tensor (batch)

slice 0

shape: (d, m, n)

Python — NumPy Data Structures

 1import numpy as np
 2
 3# ── Scalar: a single Python or NumPy number
 4alpha = 0.01                      # learning rate — just a float
 5print("Scalar:", alpha, "| ndim:", np.array(alpha).ndim)
 6
 7# ── Vector: a 1-D array — one data point or one feature column
 8v = np.array([2, 7, 5, 1])    # shape (4,)
 9print("Vector shape:", v.shape)    # (4,)
10
11# ── Matrix: a 2-D array — the entire dataset X
12A = np.array([[1, 2, 3],
13              [4, 5, 6],
14              [7, 8, 9]])           # shape (3, 3)
15print("Matrix shape:", A.shape)   # (3, 3)
16
17# ── Tensor: a 3-D array — a batch of images
18batch = np.zeros((32, 224, 224, 3))  # 32 RGB images 224x224
19print("Tensor shape:", batch.shape)  # (32, 224, 224, 3)
20print("Dimensions:", batch.ndim)    # 4
21
22# ── The ML dataset layout: X is always shape (n_samples, n_features)
23X = np.random.randn(1000, 20)   # 1000 samples, 20 features
24print(f"Dataset X: {X.shape} — rows=samples, cols=features")

Vector Operations

Vectors are the atomic unit of ML data — a single data point, a single weight vector, or a single embedding. The operations defined on vectors directly correspond to fundamental ML computations.

Vector Addition and Scalar Multiplication

Vector addition combines two vectors element-wise. Scalar multiplication scales every element uniformly. Together they define how gradient descent updates weights: the new weight vector is the old weight vector w minus the learning rate α times the gradient vector g.

w new = w old - α \cdot g Gradient descent weight update — vector subtraction scaled by a scalar

Vector Addition: a + b

[

−1

]

[

−2

]

[

]

Element-wise addition. Vectors must have the same shape. Both vectors live in the same vector space.

The Dot Product

The dot product is the single most important operation in machine learning. It multiplies corresponding elements of two vectors and sums the results, producing a single scalar. Geometrically, it measures how much two vectors point in the same direction. When the dot product is zero, the vectors are orthogonal (perpendicular). When it is large and positive, they are well-aligned.

a \cdot b = a T b = \sum i a i b i = ‖ a ‖ ‖ b ‖ cos(θ) The dot product equals the product of magnitudes times the cosine of the angle between the vectors

Dot Product: a · b

[

−1

]

[

]

(2)(4) + (3)(1) + (−1)(2) = 8 + 3 − 2 = 9

In linear regression, the prediction for one sample is exactly x · w — the dot product of the feature vector with the weight vector.

Vector Norms (Measuring Magnitude)

A norm is a function that assigns a non-negative scalar length to a vector. Different norms measure “size” differently and appear in different ML contexts. The most common are the L1 norm (sum of absolute values) and the L2 norm (Euclidean length).

L1: ‖ v ‖ 1 = \sum i | v i | L2: ‖ v ‖ 2 = \sqrt (\sum i v i 2) Lp: ‖ v ‖ p = (\sum i | v i | p) 1/p L1 norm yields sparse solutions (Lasso). L2 norm is the standard Euclidean distance (Ridge).

Norm	Formula	Geometry	ML Application
L0 Norm	‖v‖₀ = count of nonzero elements	Sparsity count	Sparsity constraints, compressed sensing
L1 Norm (Manhattan)	‖v‖₁ = ∑\|v_i\|	City-block distance	Lasso regression regularisation — promotes sparse solutions
L2 Norm (Euclidean)	‖v‖₂ = √(∑v_i²)	Straight-line distance	Ridge regression, KNN distance, gradient magnitude
L∞ Norm (Max)	‖v‖_∞ = max_i\|v_i\|	Maximum absolute value	Adversarial ML, gradient clipping

Python — Vector Operations in NumPy

 1import numpy as np
 2
 3a = np.array([2,  3, -1])
 4b = np.array([4,  1,  2])
 5
 6# ── Vector addition and scalar multiplication
 7print("a + b:", a + b)            # [6, 4, 1]
 8print("2 * a:", 2 * a)           # [4, 6, -2]
 9print("a - 0.01*b:", a - 0.01 * b)  # gradient descent step
10
11# ── Dot product — three equivalent forms
12print("Dot product (np.dot):",   np.dot(a, b))       # 9
13print("Dot product (@):",         a @ b)            # 9 — preferred syntax
14print("Dot product (manual):",   np.sum(a * b))      # 9
15
16# ── Vector norms
17print("L1 norm (Lasso):",  np.linalg.norm(a, ord=1))   # 6.0
18print("L2 norm (Ridge):",  np.linalg.norm(a, ord=2))   # 3.7417
19print("L-inf norm:",       np.linalg.norm(a, ord=np.inf)) # 3.0
20
21# ── Cosine similarity — critical for embeddings and NLP
22cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
23print(f"Cosine similarity: {cos_sim:.4f}")          # 0.6499
24print(f"Angle between vectors: {np.degrees(np.arccos(cos_sim)):.1f}°")  # 49.5°

Matrix Operations

Matrices allow us to represent and operate on entire datasets simultaneously. The operations defined on matrices are the building blocks of every ML algorithm.

Matrix Multiplication

Matrix multiplication is the fundamental operation of deep learning. Given a matrix A of shape (m, k) and a matrix B of shape (k, n), their product AB has shape (m, n). Each entry in the result is the dot product of a row from A with a column from B. The inner dimensions must match — this constraint arises directly from the dot product requirement.

(A B) ij = \sum k A ik B kj Shape rule: (m, k) @ (k, n) \to (m, n). Inner dimensions must match.

Matrix Multiplication — Shape (2,3) @ (3,2) → (2,2)

[

]

[

]

[

139

154

]

Entry (0,0) = 1×7 + 2×9 + 3×11 = 7+18+33 = 58. In a neural network dense layer, X @ W computes all neuron activations for all samples simultaneously using exactly this operation.

Transpose

The transpose of a matrix A, written A^T, flips the matrix over its main diagonal: element (i,j) becomes element (j,i). A matrix of shape (m, n) becomes shape (n, m). Transpose is used constantly in ML — in the Normal Equation for linear regression, in computing covariance matrices for PCA, and in the backpropagation algorithm.

(A T) ij = A ji (A B) T = B T A T (A T) T = A Key transpose identities used throughout ML derivations

Matrix Inverse

The inverse of a square matrix A, written A⁻¹, is the matrix such that AA⁻¹ = I (the identity matrix). Not all matrices have inverses — a matrix must be square and non-singular (determinant not equal to zero) to be invertible. The inverse appears directly in the closed-form solution for linear regression.

w = (X T X) -1 X T y The Normal Equation — the closed-form optimal solution for linear regression

Practical warning: Computing the matrix inverse directly is numerically unstable and computationally expensive for large matrices — O(n³) time complexity. In practice, ML frameworks use numerical solvers such as LU decomposition or QR decomposition to solve the equivalent linear system Ax = b without explicitly computing A⁻¹. In NumPy, always prefer np.linalg.solve(A, b) over np.linalg.inv(A) @ b.

Python — Matrix Operations and the Normal Equation

 1import numpy as np
 2
 3A = np.array([[1, 2, 3],
 4              [4, 5, 6]])           # shape (2, 3)
 5B = np.array([[7, 8],
 6              [9, 10],
 7              [11, 12]])          # shape (3, 2)
 8
 9# ── Matrix multiplication: (2,3) @ (3,2) → (2,2)
10C = A @ B
11print("A @ B =\n", C)              # [[58, 64], [139, 154]]
12
13# ── Transpose
14print("A.T shape:", A.T.shape)      # (3, 2)
15
16# ── Normal Equation for linear regression: w = (XᵀX)⁻¹ Xᵀy
17np.random.seed(42)
18n, p = 100, 3
19X  = np.hstack([np.ones((n, 1)), np.random.randn(n, p)])  # bias column
20y  = X @ np.array([1, 2, -1, 3]) + 0.1 * np.random.randn(n)
21
22# Method 1: Direct inverse (less stable)
23w_direct = np.linalg.inv(X.T @ X) @ X.T @ y
24
25# Method 2: np.linalg.solve (numerically preferred)
26w_solve  = np.linalg.solve(X.T @ X, X.T @ y)
27
28# Method 3: NumPy's built-in least squares (most robust)
29w_lstsq, *_ = np.linalg.lstsq(X, y, rcond=None)
30
31print("Recovered weights (true: [1, 2, -1, 3]):")
32print(np.round(w_lstsq, 3))    # → [1.001, 2.000, -1.001, 3.000]

Special Matrices You Must Know

Certain matrix types have special properties that are exploited throughout ML theory and algorithms. Recognising them makes derivations far more readable and computation far more efficient.

Identity Matrix (I)

A I = I A = A. Diagonal of ones, rest zeros.

The multiplicative identity. Used to verify inverses and in regularisation: the term λI is added to X^TX in Ridge regression to ensure invertibility.

Diagonal Matrix

Non-zero only on the main diagonal.

Appears as the singular value matrix Σ in SVD. Multiplication is O(n) instead of O(n²) — scaling each dimension independently. Feature scaling is a diagonal matrix operation.

Symmetric Matrix

A = A^T. Symmetric about the diagonal.

Covariance matrices and kernel matrices are always symmetric. Symmetric matrices are always diagonalisable with real eigenvalues — this is the key property that makes PCA tractable.

Orthogonal Matrix (Q)

Q^TQ = QQ^T = I, so Q⁻¹ = Q^T.

Rotation and reflection matrices are orthogonal. The principal component matrix in PCA is orthogonal — it rotates the data without distortion. SVD produces two orthogonal matrices U and V.

Positive Definite Matrix

x^TAx > 0 for all x ≠ 0. All eigenvalues positive.

The covariance matrix X^TX is positive semi-definite. Hessian matrices of convex loss functions are positive definite — this guarantees that gradient descent finds a global minimum.

Sparse Matrix

Most elements are zero.

TF-IDF matrices in NLP, user-item interaction matrices in recommendation systems. Stored in compressed formats (CSR, COO) to avoid wasting memory on zeros — critical for large-scale ML.

Python — Special Matrix Properties

 1import numpy as np
 2
 3# ── Identity matrix
 4I = np.eye(3)
 5print("Identity:\n", I)
 6
 7# ── Symmetric check: covariance matrix is always symmetric
 8X = np.random.randn(50, 4)
 9C = X.T @ X / 50                    # covariance matrix, shape (4, 4)
10print("Is symmetric:", np.allclose(C, C.T))   # True
11
12# ── Orthogonal check: QR decomposition gives an orthogonal Q
13Q, R = np.linalg.qr(np.random.randn(4, 4))
14print("Q is orthogonal (QᵀQ = I):", np.allclose(Q.T @ Q, I))  # True
15
16# ── Positive definite: covariance matrix has all positive eigenvalues
17eigenvalues, _ = np.linalg.eigh(C)      # eigh for symmetric matrices
18print("All eigenvalues positive:", np.all(eigenvalues > 0))     # True
19
20# ── Sparse matrix for memory efficiency
21from scipy.sparse import csr_matrix
22dense  = np.eye(10000)                # 10000x10000 identity = 800 MB dense
23sparse = csr_matrix(dense)           # sparse: stores only 10000 nonzeros
24print(f"Sparse stored: {sparse.nnz} elements vs {dense.size} in dense")

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors reveal the intrinsic structure of a linear transformation. Given a square matrix A, a non-zero vector v is an eigenvector of A if multiplying A by v only stretches or compresses the vector — it does not rotate it. The amount of stretching is given by the corresponding eigenvalue λ.

A v = λ v The eigenvalue equation: A times eigenvector v = eigenvalue λ times eigenvector v. The direction of v is preserved; only its magnitude changes.

Eigendecomposition: from Matrix to Structure

The Matrix A

Av = λv

Square matrix A encodes a linear transformation. We want to understand what directions it stretches and by how much.

→

Eigenvectors (V)

det(A − λI) = 0

Directions that are only scaled — not rotated — by the transformation. Solved by finding the characteristic polynomial.

→

Eigenvalues (λ)

A = Q Λ Q⁻¹

The scaling factors along each eigenvector direction. Large eigenvalue = dominant direction. Decompose A as A = QΛQ⁻¹.

PCA is eigendecomposition of the covariance matrix. When you run PCA, you are computing the eigenvectors of the covariance matrix C = X^TX / n. Each eigenvector defines a principal component — a direction of maximum variance. The corresponding eigenvalue tells you how much variance is captured along that direction. Sorting by eigenvalue in descending order gives you the most informative directions first.

If the first two eigenvalues account for 95 percent of the total variance, projecting onto just those two eigenvectors retains 95 percent of the information in a fraction of the storage — that is the power of eigendecomposition.

Python — Eigendecomposition and PCA from Scratch

 1import numpy as np
 2
 3# ── Verify the eigenvalue equation: Av = λv
 4A  = np.array([[4, 1],
 5              [2, 3]])
 6eigenvalues, eigenvectors = np.linalg.eig(A)
 7print("Eigenvalues:", eigenvalues)             # [5., 2.]
 8print("Eigenvectors (columns):\n", eigenvectors)
 9
10# Verify: Av ≈ λv for the first eigenpair
11lam, v = eigenvalues[0], eigenvectors[:, 0]
12print("Av == λv:", np.allclose(A @ v, lam * v))  # True
13
14# ── PCA from scratch using eigendecomposition
15np.random.seed(0)
16X     = np.random.randn(200, 5)       # 200 samples, 5 features
17X_c   = X - X.mean(axis=0)          # Step 1: centre the data
18C     = X_c.T @ X_c / 200          # Step 2: covariance matrix (5×5)
19vals,  vecs = np.linalg.eigh(C)   # Step 3: eigen-decompose (use eigh for symmetric)
20
21# Step 4: sort by eigenvalue descending
22idx   = np.argsort(vals)[::-1]
23vals, vecs = vals[idx], vecs[:, idx]
24
25# Step 5: explained variance
26explained = vals / vals.sum()
27print("Variance explained by each PC:", np.round(explained, 3))
28
29# Step 6: project onto top 2 components
30W     = vecs[:, :2]                 # top-2 eigenvectors, shape (5, 2)
31X_pca = X_c @ W                   # shape (200, 2) — 2D representation
32print("PCA output shape:", X_pca.shape)   # (200, 2)

Singular Value Decomposition (SVD)

Singular Value Decomposition is one of the most powerful and general decompositions in linear algebra. While eigendecomposition requires a square matrix, SVD works on any matrix of any shape — making it universally applicable to real datasets. SVD decomposes any (m, n) matrix A into three matrices:

A = U Σ V T SVD decomposition: U (m\timesm orthogonal) \cdot Σ (m\timesn diagonal, singular values) \cdot V T (n\timesn orthogonal)

SVD Decomposition — Visual Structure

[

]

(3 × 4) matrix

[

]

(3×3) orthogonal
left singular vectors

[

σ₁

σ₂

σ₃

]

(3×4) diagonal
singular values σ₁≥σ₂≥σ₃≥0

V^T

[

]

(4×4) orthogonal
right singular vectors

Truncated SVD: keep only the top-k singular values and vectors. Multiplying only U_k Σ_k V_k^T gives the best rank-k approximation of A — the mathematical foundation of Latent Semantic Analysis and collaborative filtering.

Python — SVD: Full Decomposition and Low-Rank Approximation

 1import numpy as np
 2
 3# ── Full SVD decomposition
 4np.random.seed(7)
 5A = np.random.randn(4, 6)    # a non-square (4, 6) matrix
 6
 7U, sigma, Vt = np.linalg.svd(A, full_matrices=True)
 8print(f"U: {U.shape}, sigma: {sigma.shape}, Vt: {Vt.shape}")
 9# → U: (4,4), sigma: (4,), Vt: (6,6)
10print("Singular values (descending):", np.round(sigma, 3))
11
12# Verify reconstruction: A = U Σ Vᵀ
13Sigma = np.zeros_like(A)
14Sigma[:len(sigma), :len(sigma)] = np.diag(sigma)
15A_reconstructed = U @ Sigma @ Vt
16print("Reconstruction error:", np.allclose(A, A_reconstructed))  # True
17
18# ── Low-rank approximation (k=2 components)
19k    = 2
20U_k  = U[:, :k]                     # (4, 2)
21S_k  = np.diag(sigma[:k])           # (2, 2)
22Vt_k = Vt[:k, :]                   # (2, 6)
23A_approx = U_k @ S_k @ Vt_k        # best rank-2 approximation of A
24
25frob_error = np.linalg.norm(A - A_approx, 'fro')
26var_captured = sigma[:k].sum() / sigma.sum()
27print(f"Rank-{k} approximation captures {var_captured:.1%} of total singular value energy")
28
29# ── Truncated SVD using scikit-learn (efficient for large sparse matrices)
30from sklearn.decomposition import TruncatedSVD
31svd = TruncatedSVD(n_components=2, random_state=42)
32X_lsa = svd.fit_transform(A)  # Latent Semantic Analysis on a doc-term matrix
33print("Explained variance ratio:", svd.explained_variance_ratio_)

Linear Algebra Inside a Neural Network Forward Pass

A neural network is, at its mathematical core, a sequence of matrix multiplications interleaved with non-linear activation functions. Understanding this makes backpropagation derivations, weight initialisation strategies, and efficiency considerations immediately intuitive.

For a batch of n samples passed through a single dense layer with d_in input features and d_out output neurons, the operation is one matrix multiplication plus broadcasting a bias vector.

Z = X W + b A = f (Z) X: (n \times d_in) — the batch. W: (d_in \times d_out) — the learnable weights. b: (d_out,) — the bias. f: the activation function applied element-wise.

Python — Neural Network Dense Layer from Linear Algebra

 1import numpy as np
 2
 3# ── Network architecture: 32 samples, 10 features, 2 hidden layers
 4np.random.seed(0)
 5
 6n, d_in, d_h1, d_h2, d_out = 32, 10, 64, 32, 1
 7
 8# ── Inputs and labels
 9X = np.random.randn(n, d_in)       # shape (32, 10)
10
11# ── He initialisation (for ReLU) — scales weights by √(2/n_in)
12W1 = np.random.randn(d_in,  d_h1) * np.sqrt(2 / d_in)
13b1 = np.zeros(d_h1)                # bias initialised to zero
14W2 = np.random.randn(d_h1, d_h2)  * np.sqrt(2 / d_h1)
15b2 = np.zeros(d_h2)
16W3 = np.random.randn(d_h2, d_out) * np.sqrt(2 / d_h2)
17b3 = np.zeros(d_out)
18
19# ── Forward pass: purely matrix multiplications + activation functions
20Z1 = X  @ W1 + b1               # (32,10)@(10,64) = (32,64)
21A1 = np.maximum(0, Z1)           # ReLU activation — element-wise
22
23Z2 = A1 @ W2 + b2               # (32,64)@(64,32) = (32,32)
24A2 = np.maximum(0, Z2)           # ReLU activation
25
26Z3 = A2 @ W3 + b3               # (32,32)@(32,1)  = (32,1)
27y_hat = 1 / (1 + np.exp(-Z3))    # sigmoid — final probabilities
28
29print("Output shape:", y_hat.shape)  # (32, 1)
30print("Predictions (first 5):", y_hat[::5, 0].round(3))

Shape tracking is everything. The single most useful debugging skill in deep learning is tracking tensor shapes through every operation. When a matrix multiplication fails, it is always because the inner dimensions do not match. Writing the shape of every intermediate tensor as a comment — as done above — is standard professional practice and eliminates an entire class of bugs before they occur.

Complete Linear Algebra Quick Reference for ML

The table below consolidates the most frequently used operations and properties. Bookmark this as a reference when working through the algorithms in the chapters that follow.

Concept	NumPy Syntax	Where Used in ML
Dot product	a @ b or np.dot(a, b)	Linear regression predictions, attention scores, cosine similarity
Matrix multiply	A @ B	Dense layer forward pass, covariance matrix, Normal Equation
Transpose	A.T or A.transpose()	Backpropagation, Normal Equation, covariance computation
Matrix inverse	np.linalg.inv(A)	Normal Equation (prefer np.linalg.solve for stability)
L2 norm	np.linalg.norm(v)	Ridge regularisation, gradient clipping, cosine similarity
L1 norm	np.linalg.norm(v, 1)	Lasso regularisation, Manhattan distance in KNN
Eigendecomposition	np.linalg.eigh(A)	PCA, spectral clustering, kernel methods
SVD	np.linalg.svd(A)	PCA, truncated SVD, LSA, matrix factorisation for recommenders
Determinant	np.linalg.det(A)	Multivariate Gaussian density, invertibility check
Trace	np.trace(A)	Sum of eigenvalues, total variance, nuclear norm
Rank	np.linalg.matrix_rank(A)	Detecting multicollinearity, checking if system is solvable
Element-wise multiply	A * B (Hadamard product)	Dropout masks, attention masking, gradient application
Outer product	np.outer(a, b)	Gradient of linear layers, rank-1 updates
Broadcasting	A + b (auto-aligned)	Adding bias vectors to batch outputs in neural network layers

Key Takeaways

Every ML dataset is a matrix X of shape (n_samples, n_features). Every model prediction is a matrix-vector product.
The dot product is the fundamental operation: it measures alignment between two vectors and powers linear regression, neural network layers, attention mechanisms, and similarity search.
L1 and L2 norms appear directly as the regularisation terms in Lasso and Ridge regression, controlling model complexity by penalising large weights.
Always use np.linalg.solve instead of np.linalg.inv for solving linear systems — it is faster and numerically more stable.
Eigendecomposition of the covariance matrix is the mathematical foundation of PCA. Larger eigenvalues correspond to directions of greater variance.
SVD works on any matrix of any shape and produces the best low-rank approximation — the foundation of Latent Semantic Analysis, collaborative filtering, and many compression techniques.
A neural network forward pass is a sequence of matrix multiplications interleaved with non-linear activations. Shape tracking at every step is the primary debugging tool.
Symmetric matrices always have real eigenvalues. Orthogonal matrices preserve lengths and angles. Positive definite matrices guarantee convex optimisation landscapes.

What’s Next?

In Chapter 2.2 — Calculus for Optimisation (Derivatives and Gradients), we build on linear algebra to understand how ML models actually learn. The gradient is a vector of partial derivatives — a linear algebra object — and gradient descent moves through the loss surface using these vectors. We will cover partial derivatives, the chain rule (which powers backpropagation), the Jacobian and Hessian matrices, and how all of these map directly to code using automatic differentiation in PyTorch.

2.1 Linear Algebra for Machine Learning