Linear Algebra for Machine Learning
Linear algebra is the language of machine learning. Every dataset is a matrix. Every prediction is a dot product. Every dimensionality reduction, neural network weight update, and similarity computation is, at its core, a linear algebra operation. Understanding these foundations is not optional — it is what separates a practitioner who runs algorithms from one who truly understands what is happening inside them.
Why Linear Algebra is Central to ML
When you call model.fit(X, y), dozens of linear algebra operations execute behind the scenes. When a neural network performs a forward pass, it is computing matrix multiplications. When PCA reduces 500 features to 10, it is performing an eigendecomposition. When you compute cosine similarity between two document embeddings, you are computing a normalised dot product.
The reason GPUs accelerate ML so dramatically is that they are architecturally optimised for one specific class of computation: batched matrix operations. Every hardware and software breakthrough in deep learning — from CUDA to TensorFlow to PyTorch — is fundamentally an innovation in making linear algebra faster.
This chapter builds your intuition and computational fluency with the six core linear algebra concepts that appear repeatedly throughout the ML curriculum: scalars, vectors, matrices, and tensors; vector and matrix operations; special matrices; linear transformations; eigenvalues and eigenvectors; and Singular Value Decomposition (SVD).
Scalars, Vectors, Matrices, and Tensors
Before any operation can be performed, we must understand the fundamental data structures that linear algebra works with. These four types form a natural hierarchy, each being a generalisation of the previous.
1import numpy as np 2 3# ── Scalar: a single Python or NumPy number 4alpha = 0.01 # learning rate — just a float 5print("Scalar:", alpha, "| ndim:", np.array(alpha).ndim) 6 7# ── Vector: a 1-D array — one data point or one feature column 8v = np.array([2, 7, 5, 1]) # shape (4,) 9print("Vector shape:", v.shape) # (4,) 10 11# ── Matrix: a 2-D array — the entire dataset X 12A = np.array([[1, 2, 3], 13 [4, 5, 6], 14 [7, 8, 9]]) # shape (3, 3) 15print("Matrix shape:", A.shape) # (3, 3) 16 17# ── Tensor: a 3-D array — a batch of images 18batch = np.zeros((32, 224, 224, 3)) # 32 RGB images 224x224 19print("Tensor shape:", batch.shape) # (32, 224, 224, 3) 20print("Dimensions:", batch.ndim) # 4 21 22# ── The ML dataset layout: X is always shape (n_samples, n_features) 23X = np.random.randn(1000, 20) # 1000 samples, 20 features 24print(f"Dataset X: {X.shape} — rows=samples, cols=features")
Vector Operations
Vectors are the atomic unit of ML data — a single data point, a single weight vector, or a single embedding. The operations defined on vectors directly correspond to fundamental ML computations.
Vector Addition and Scalar Multiplication
Vector addition combines two vectors element-wise. Scalar multiplication scales every element uniformly. Together they define how gradient descent updates weights: the new weight vector is the old weight vector w minus the learning rate α times the gradient vector g.
The Dot Product
The dot product is the single most important operation in machine learning. It multiplies corresponding elements of two vectors and sums the results, producing a single scalar. Geometrically, it measures how much two vectors point in the same direction. When the dot product is zero, the vectors are orthogonal (perpendicular). When it is large and positive, they are well-aligned.
Vector Norms (Measuring Magnitude)
A norm is a function that assigns a non-negative scalar length to a vector. Different norms measure “size” differently and appear in different ML contexts. The most common are the L1 norm (sum of absolute values) and the L2 norm (Euclidean length).
| Norm | Formula | Geometry | ML Application |
|---|---|---|---|
| L0 Norm | ‖v‖0 = count of nonzero elements | Sparsity count | Sparsity constraints, compressed sensing |
| L1 Norm (Manhattan) | ‖v‖1 = ∑|vi| | City-block distance | Lasso regression regularisation — promotes sparse solutions |
| L2 Norm (Euclidean) | ‖v‖2 = √(∑vi²) | Straight-line distance | Ridge regression, KNN distance, gradient magnitude |
| L∞ Norm (Max) | ‖v‖∞ = maxi|vi| | Maximum absolute value | Adversarial ML, gradient clipping |
1import numpy as np 2 3a = np.array([2, 3, -1]) 4b = np.array([4, 1, 2]) 5 6# ── Vector addition and scalar multiplication 7print("a + b:", a + b) # [6, 4, 1] 8print("2 * a:", 2 * a) # [4, 6, -2] 9print("a - 0.01*b:", a - 0.01 * b) # gradient descent step 10 11# ── Dot product — three equivalent forms 12print("Dot product (np.dot):", np.dot(a, b)) # 9 13print("Dot product (@):", a @ b) # 9 — preferred syntax 14print("Dot product (manual):", np.sum(a * b)) # 9 15 16# ── Vector norms 17print("L1 norm (Lasso):", np.linalg.norm(a, ord=1)) # 6.0 18print("L2 norm (Ridge):", np.linalg.norm(a, ord=2)) # 3.7417 19print("L-inf norm:", np.linalg.norm(a, ord=np.inf)) # 3.0 20 21# ── Cosine similarity — critical for embeddings and NLP 22cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) 23print(f"Cosine similarity: {cos_sim:.4f}") # 0.6499 24print(f"Angle between vectors: {np.degrees(np.arccos(cos_sim)):.1f}°") # 49.5°
Matrix Operations
Matrices allow us to represent and operate on entire datasets simultaneously. The operations defined on matrices are the building blocks of every ML algorithm.
Matrix Multiplication
Matrix multiplication is the fundamental operation of deep learning. Given a matrix A of shape (m, k) and a matrix B of shape (k, n), their product AB has shape (m, n). Each entry in the result is the dot product of a row from A with a column from B. The inner dimensions must match — this constraint arises directly from the dot product requirement.
Transpose
The transpose of a matrix A, written AT, flips the matrix over its main diagonal: element (i,j) becomes element (j,i). A matrix of shape (m, n) becomes shape (n, m). Transpose is used constantly in ML — in the Normal Equation for linear regression, in computing covariance matrices for PCA, and in the backpropagation algorithm.
Matrix Inverse
The inverse of a square matrix A, written A−1, is the matrix such that AA−1 = I (the identity matrix). Not all matrices have inverses — a matrix must be square and non-singular (determinant not equal to zero) to be invertible. The inverse appears directly in the closed-form solution for linear regression.
Practical warning: Computing the matrix inverse directly is numerically unstable and computationally expensive for large matrices — O(n³) time complexity. In practice, ML frameworks use numerical solvers such as LU decomposition or QR decomposition to solve the equivalent linear system Ax = b without explicitly computing A−1. In NumPy, always prefer np.linalg.solve(A, b) over np.linalg.inv(A) @ b.
1import numpy as np 2 3A = np.array([[1, 2, 3], 4 [4, 5, 6]]) # shape (2, 3) 5B = np.array([[7, 8], 6 [9, 10], 7 [11, 12]]) # shape (3, 2) 8 9# ── Matrix multiplication: (2,3) @ (3,2) → (2,2) 10C = A @ B 11print("A @ B =\n", C) # [[58, 64], [139, 154]] 12 13# ── Transpose 14print("A.T shape:", A.T.shape) # (3, 2) 15 16# ── Normal Equation for linear regression: w = (XᵀX)⁻¹ Xᵀy 17np.random.seed(42) 18n, p = 100, 3 19X = np.hstack([np.ones((n, 1)), np.random.randn(n, p)]) # bias column 20y = X @ np.array([1, 2, -1, 3]) + 0.1 * np.random.randn(n) 21 22# Method 1: Direct inverse (less stable) 23w_direct = np.linalg.inv(X.T @ X) @ X.T @ y 24 25# Method 2: np.linalg.solve (numerically preferred) 26w_solve = np.linalg.solve(X.T @ X, X.T @ y) 27 28# Method 3: NumPy's built-in least squares (most robust) 29w_lstsq, *_ = np.linalg.lstsq(X, y, rcond=None) 30 31print("Recovered weights (true: [1, 2, -1, 3]):") 32print(np.round(w_lstsq, 3)) # → [1.001, 2.000, -1.001, 3.000]
Special Matrices You Must Know
Certain matrix types have special properties that are exploited throughout ML theory and algorithms. Recognising them makes derivations far more readable and computation far more efficient.
1import numpy as np 2 3# ── Identity matrix 4I = np.eye(3) 5print("Identity:\n", I) 6 7# ── Symmetric check: covariance matrix is always symmetric 8X = np.random.randn(50, 4) 9C = X.T @ X / 50 # covariance matrix, shape (4, 4) 10print("Is symmetric:", np.allclose(C, C.T)) # True 11 12# ── Orthogonal check: QR decomposition gives an orthogonal Q 13Q, R = np.linalg.qr(np.random.randn(4, 4)) 14print("Q is orthogonal (QᵀQ = I):", np.allclose(Q.T @ Q, I)) # True 15 16# ── Positive definite: covariance matrix has all positive eigenvalues 17eigenvalues, _ = np.linalg.eigh(C) # eigh for symmetric matrices 18print("All eigenvalues positive:", np.all(eigenvalues > 0)) # True 19 20# ── Sparse matrix for memory efficiency 21from scipy.sparse import csr_matrix 22dense = np.eye(10000) # 10000x10000 identity = 800 MB dense 23sparse = csr_matrix(dense) # sparse: stores only 10000 nonzeros 24print(f"Sparse stored: {sparse.nnz} elements vs {dense.size} in dense")
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors reveal the intrinsic structure of a linear transformation. Given a square matrix A, a non-zero vector v is an eigenvector of A if multiplying A by v only stretches or compresses the vector — it does not rotate it. The amount of stretching is given by the corresponding eigenvalue λ.
PCA is eigendecomposition of the covariance matrix. When you run PCA, you are computing the eigenvectors of the covariance matrix C = XTX / n. Each eigenvector defines a principal component — a direction of maximum variance. The corresponding eigenvalue tells you how much variance is captured along that direction. Sorting by eigenvalue in descending order gives you the most informative directions first.
If the first two eigenvalues account for 95 percent of the total variance, projecting onto just those two eigenvectors retains 95 percent of the information in a fraction of the storage — that is the power of eigendecomposition.
1import numpy as np 2 3# ── Verify the eigenvalue equation: Av = λv 4A = np.array([[4, 1], 5 [2, 3]]) 6eigenvalues, eigenvectors = np.linalg.eig(A) 7print("Eigenvalues:", eigenvalues) # [5., 2.] 8print("Eigenvectors (columns):\n", eigenvectors) 9 10# Verify: Av ≈ λv for the first eigenpair 11lam, v = eigenvalues[0], eigenvectors[:, 0] 12print("Av == λv:", np.allclose(A @ v, lam * v)) # True 13 14# ── PCA from scratch using eigendecomposition 15np.random.seed(0) 16X = np.random.randn(200, 5) # 200 samples, 5 features 17X_c = X - X.mean(axis=0) # Step 1: centre the data 18C = X_c.T @ X_c / 200 # Step 2: covariance matrix (5×5) 19vals, vecs = np.linalg.eigh(C) # Step 3: eigen-decompose (use eigh for symmetric) 20 21# Step 4: sort by eigenvalue descending 22idx = np.argsort(vals)[::-1] 23vals, vecs = vals[idx], vecs[:, idx] 24 25# Step 5: explained variance 26explained = vals / vals.sum() 27print("Variance explained by each PC:", np.round(explained, 3)) 28 29# Step 6: project onto top 2 components 30W = vecs[:, :2] # top-2 eigenvectors, shape (5, 2) 31X_pca = X_c @ W # shape (200, 2) — 2D representation 32print("PCA output shape:", X_pca.shape) # (200, 2)
Singular Value Decomposition (SVD)
Singular Value Decomposition is one of the most powerful and general decompositions in linear algebra. While eigendecomposition requires a square matrix, SVD works on any matrix of any shape — making it universally applicable to real datasets. SVD decomposes any (m, n) matrix A into three matrices:
left singular vectors
singular values σ₁≥σ₂≥σ₃≥0
right singular vectors
Truncated SVD: keep only the top-k singular values and vectors. Multiplying only Uk Σk VkT gives the best rank-k approximation of A — the mathematical foundation of Latent Semantic Analysis and collaborative filtering.
1import numpy as np 2 3# ── Full SVD decomposition 4np.random.seed(7) 5A = np.random.randn(4, 6) # a non-square (4, 6) matrix 6 7U, sigma, Vt = np.linalg.svd(A, full_matrices=True) 8print(f"U: {U.shape}, sigma: {sigma.shape}, Vt: {Vt.shape}") 9# → U: (4,4), sigma: (4,), Vt: (6,6) 10print("Singular values (descending):", np.round(sigma, 3)) 11 12# Verify reconstruction: A = U Σ Vᵀ 13Sigma = np.zeros_like(A) 14Sigma[:len(sigma), :len(sigma)] = np.diag(sigma) 15A_reconstructed = U @ Sigma @ Vt 16print("Reconstruction error:", np.allclose(A, A_reconstructed)) # True 17 18# ── Low-rank approximation (k=2 components) 19k = 2 20U_k = U[:, :k] # (4, 2) 21S_k = np.diag(sigma[:k]) # (2, 2) 22Vt_k = Vt[:k, :] # (2, 6) 23A_approx = U_k @ S_k @ Vt_k # best rank-2 approximation of A 24 25frob_error = np.linalg.norm(A - A_approx, 'fro') 26var_captured = sigma[:k].sum() / sigma.sum() 27print(f"Rank-{k} approximation captures {var_captured:.1%} of total singular value energy") 28 29# ── Truncated SVD using scikit-learn (efficient for large sparse matrices) 30from sklearn.decomposition import TruncatedSVD 31svd = TruncatedSVD(n_components=2, random_state=42) 32X_lsa = svd.fit_transform(A) # Latent Semantic Analysis on a doc-term matrix 33print("Explained variance ratio:", svd.explained_variance_ratio_)
Linear Algebra Inside a Neural Network Forward Pass
A neural network is, at its mathematical core, a sequence of matrix multiplications interleaved with non-linear activation functions. Understanding this makes backpropagation derivations, weight initialisation strategies, and efficiency considerations immediately intuitive.
For a batch of n samples passed through a single dense layer with d_in input features and d_out output neurons, the operation is one matrix multiplication plus broadcasting a bias vector.
1import numpy as np 2 3# ── Network architecture: 32 samples, 10 features, 2 hidden layers 4np.random.seed(0) 5 6n, d_in, d_h1, d_h2, d_out = 32, 10, 64, 32, 1 7 8# ── Inputs and labels 9X = np.random.randn(n, d_in) # shape (32, 10) 10 11# ── He initialisation (for ReLU) — scales weights by √(2/n_in) 12W1 = np.random.randn(d_in, d_h1) * np.sqrt(2 / d_in) 13b1 = np.zeros(d_h1) # bias initialised to zero 14W2 = np.random.randn(d_h1, d_h2) * np.sqrt(2 / d_h1) 15b2 = np.zeros(d_h2) 16W3 = np.random.randn(d_h2, d_out) * np.sqrt(2 / d_h2) 17b3 = np.zeros(d_out) 18 19# ── Forward pass: purely matrix multiplications + activation functions 20Z1 = X @ W1 + b1 # (32,10)@(10,64) = (32,64) 21A1 = np.maximum(0, Z1) # ReLU activation — element-wise 22 23Z2 = A1 @ W2 + b2 # (32,64)@(64,32) = (32,32) 24A2 = np.maximum(0, Z2) # ReLU activation 25 26Z3 = A2 @ W3 + b3 # (32,32)@(32,1) = (32,1) 27y_hat = 1 / (1 + np.exp(-Z3)) # sigmoid — final probabilities 28 29print("Output shape:", y_hat.shape) # (32, 1) 30print("Predictions (first 5):", y_hat[::5, 0].round(3))
Shape tracking is everything. The single most useful debugging skill in deep learning is tracking tensor shapes through every operation. When a matrix multiplication fails, it is always because the inner dimensions do not match. Writing the shape of every intermediate tensor as a comment — as done above — is standard professional practice and eliminates an entire class of bugs before they occur.
Complete Linear Algebra Quick Reference for ML
The table below consolidates the most frequently used operations and properties. Bookmark this as a reference when working through the algorithms in the chapters that follow.
| Concept | NumPy Syntax | Where Used in ML |
|---|---|---|
| Dot product | a @ b or np.dot(a, b) | Linear regression predictions, attention scores, cosine similarity |
| Matrix multiply | A @ B | Dense layer forward pass, covariance matrix, Normal Equation |
| Transpose | A.T or A.transpose() | Backpropagation, Normal Equation, covariance computation |
| Matrix inverse | np.linalg.inv(A) | Normal Equation (prefer np.linalg.solve for stability) |
| L2 norm | np.linalg.norm(v) | Ridge regularisation, gradient clipping, cosine similarity |
| L1 norm | np.linalg.norm(v, 1) | Lasso regularisation, Manhattan distance in KNN |
| Eigendecomposition | np.linalg.eigh(A) | PCA, spectral clustering, kernel methods |
| SVD | np.linalg.svd(A) | PCA, truncated SVD, LSA, matrix factorisation for recommenders |
| Determinant | np.linalg.det(A) | Multivariate Gaussian density, invertibility check |
| Trace | np.trace(A) | Sum of eigenvalues, total variance, nuclear norm |
| Rank | np.linalg.matrix_rank(A) | Detecting multicollinearity, checking if system is solvable |
| Element-wise multiply | A * B (Hadamard product) | Dropout masks, attention masking, gradient application |
| Outer product | np.outer(a, b) | Gradient of linear layers, rank-1 updates |
| Broadcasting | A + b (auto-aligned) | Adding bias vectors to batch outputs in neural network layers |
Key Takeaways
- Every ML dataset is a matrix X of shape (n_samples, n_features). Every model prediction is a matrix-vector product.
- The dot product is the fundamental operation: it measures alignment between two vectors and powers linear regression, neural network layers, attention mechanisms, and similarity search.
- L1 and L2 norms appear directly as the regularisation terms in Lasso and Ridge regression, controlling model complexity by penalising large weights.
- Always use np.linalg.solve instead of np.linalg.inv for solving linear systems — it is faster and numerically more stable.
- Eigendecomposition of the covariance matrix is the mathematical foundation of PCA. Larger eigenvalues correspond to directions of greater variance.
- SVD works on any matrix of any shape and produces the best low-rank approximation — the foundation of Latent Semantic Analysis, collaborative filtering, and many compression techniques.
- A neural network forward pass is a sequence of matrix multiplications interleaved with non-linear activations. Shape tracking at every step is the primary debugging tool.
- Symmetric matrices always have real eigenvalues. Orthogonal matrices preserve lengths and angles. Positive definite matrices guarantee convex optimisation landscapes.
What’s Next?
In Chapter 2.2 — Calculus for Optimisation (Derivatives and Gradients), we build on linear algebra to understand how ML models actually learn. The gradient is a vector of partial derivatives — a linear algebra object — and gradient descent moves through the loss surface using these vectors. We will cover partial derivatives, the chain rule (which powers backpropagation), the Jacobian and Hessian matrices, and how all of these map directly to code using automatic differentiation in PyTorch.