Probability Theory and Distributions
Probability is the mathematical language of uncertainty — and Machine Learning is, at its core, a discipline of reasoning under uncertainty. Every prediction a model makes is a statement about probability. Understanding how randomness, likelihood, and distributions work is not optional background knowledge; it is the foundation upon which every algorithm from logistic regression to deep neural networks is built.
Why Does Machine Learning Need Probability Theory?
Real-world data is messy, incomplete, and inherently uncertain. A model rarely has access to perfect information. Probability theory gives us the tools to quantify uncertainty, make principled predictions, and understand how confident we should be in any given output. Here is where probability appears concretely in ML:
Sample Space, Events, and Probability
Every probability problem begins with a sample space — the set of all possible outcomes of a random experiment. From there, we define events as subsets of the sample space and assign probabilities to them.
The Three Axioms of Probability (Kolmogorov, 1933)
All of modern probability theory rests on three foundational axioms formalized by the Russian mathematician Andrei Kolmogorov. Every rule you use — from adding probabilities to Bayes’ theorem — is derived from just these three statements.
P(A) >= 0 for all events A
P(Ω) = 1
P(A ∪ B) = P(A) + P(B)
Key derivation: The general addition rule for non-mutually-exclusive events follows directly from Axiom 3:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
We subtract P(A ∩ B) to avoid double-counting the overlap region. Without this correction, elements in both A and B would be counted twice.
Three Interpretations of Probability
There are three main philosophical interpretations of what a probability number actually means. Machine Learning draws from all three depending on context.
Interpretation: Probability is the long-run frequency of an event over many repeated trials.
ML use: Training accuracy, confidence intervals, hypothesis tests.
Interpretation: Probability represents a degree of belief, updated as new evidence arrives.
ML use: Bayesian neural networks, Gaussian processes, Naive Bayes.
Conditional Probability
Conditional probability answers the question: given that we know event B has already occurred, how does that change the probability of event A? This is arguably the single most important concept in probabilistic Machine Learning.
Conditional Probability — Visual Intuition
P(A|B) zooms in on circle B. Within that restricted universe, what fraction of area is also in A?
Independence
Two events A and B are statistically independent if knowing that B occurred gives you no information about whether A occurred. Formally:
Critical ML connection: The Naive Bayes classifier is called “naive” precisely because it assumes all input features are conditionally independent given the class label — a simplification that rarely holds exactly in real data, but works surprisingly well in practice. Every time you compute a likelihood as a product of per-feature probabilities, you are invoking independence.
Bayes’ Theorem — The Engine of Probabilistic Reasoning
Bayes’ theorem is the formal rule for updating a belief (probability) when new evidence arrives. It is derived entirely from the definition of conditional probability and is arguably the most important formula in all of Machine Learning and statistics.
Bayes’ Theorem — Component Flow
P(θ)
P(X|θ)
P(X)
P(θ|X)
The posterior combines our prior belief with the evidence from data. As more data arrives, the posterior dominates and the prior becomes less influential.
A Concrete Medical Diagnosis Example
A disease affects 1% of the population. A test for it is 95% accurate (true positive rate) and has a 5% false positive rate. A patient tests positive. What is the actual probability they have the disease?
Counterintuitive result: Despite a 95% accurate test, a positive result only means a 16% chance of actually having the disease — because the disease is rare. This is why prevalence (the prior) matters enormously. Spam filters, fraud detectors, and medical AI systems all wrestle with exactly this problem: the rare-event trap.
Random Variables — Formalizing Uncertainty
A random variable is a function that maps outcomes from a sample space to numerical values. Rather than talking about “the outcome of rolling a die,” we define X = “the number showing on top” and work with its numerical properties. Random variables are either discrete or continuous.
Takes on a countable number of distinct values. Each outcome has a specific, non-zero probability.
The sum of all PMF values = 1.
Takes on any value within an interval. The probability of any single exact value is zero; we measure probability over ranges.
The integral of the PDF over all x = 1.
PMF, PDF, and CDF — The Three Core Functions
| Function | Full Name | Applies To | Definition | Key Property |
|---|---|---|---|---|
| PMF | Probability Mass Function | Discrete | P(X = x) — exact probability at point x | ∑ P(X=x) = 1 over all x |
| Probability Density Function | Continuous | f(x) such that P(a ≤ X ≤ b) = ∫ f(x)dx | ∫ f(x)dx = 1 from −∞ to +∞ | |
| CDF | Cumulative Distribution Function | Both | F(x) = P(X ≤ x) — probability up to x | Monotone non-decreasing; F(−∞)=0, F(+∞)=1 |
Key insight for ML practitioners: For a continuous distribution, P(X = 5.0000) = 0 — the probability of any exact single value is zero. This is why we always speak of probabilities over intervals (e.g., “probability that height is between 170 and 180 cm”) and why the PDF itself can have values greater than 1, as long as the integral over all space equals 1.
Expected Value, Variance, and Standard Deviation
Once we have a distribution, we want to summarize it with a few key numbers. The expected value tells us the center; the variance tells us the spread. These are the building blocks of every loss function and evaluation metric in ML.
Essential Discrete Distributions
These three discrete distributions are the workhorses of classification, counting problems, and binary outcomes in Machine Learning. You will encounter them repeatedly across algorithms, loss functions, and probabilistic models.
PMF:
P(X=1) = p, P(X=0) = 1−pMean:
E[X] = pVariance:
Var(X) = p(1−p)ML uses: Binary classification output layer (sigmoid), binary cross-entropy loss, logistic regression.
PMF:
P(X=k) = C(n,k) pᵏ (1−p)ⁿ⁻ᵏMean:
E[X] = npVariance:
Var(X) = np(1−p)ML uses: Modelling success counts in batch experiments, A/B testing, bootstrapping (sampling with replacement).
PMF:
P(X=k) = e⁻λ λᵏ / k!Mean:
E[X] = λVariance:
Var(X) = λML uses: Count regression, modelling rare events, NLP word frequency models, neural network activation counts.
1from scipy import stats 2import numpy as np 3 4# ── BERNOULLI DISTRIBUTION ───────────────────────────────── 5p = 0.7 6bern = stats.bernoulli(p) 7print(f"Bernoulli(p=0.7): Mean={bern.mean()}, Var={bern.var()}") 8print(f"P(X=1)={bern.pmf(1)}, P(X=0)={bern.pmf(0)}") 9# → Mean=0.7, Var=0.21 | P(X=1)=0.7, P(X=0)=0.3 10 11# ── BINOMIAL DISTRIBUTION ────────────────────────────────── 12n, p = 10, 0.3 13binom = stats.binom(n, p) 14print(f"Binomial(10, 0.3): Mean={binom.mean()}, Var={binom.var()}") 15print(f"P(X=3 successes)={binom.pmf(3):.4f}") 16# → Mean=3.0, Var=2.1 | P(X=3)=0.2668 17 18# ── POISSON DISTRIBUTION ─────────────────────────────────── 19lam = 4 # average 4 emails per hour 20pois = stats.poisson(lam) 21print(f"Poisson(λ=4): Mean={pois.mean()}, Var={pois.var()}") 22print(f"P(exactly 6 emails)={pois.pmf(6):.4f}") 23print(f"P(at most 5 emails) = {pois.cdf(5):.4f}") 24# → Mean=4, Var=4 | P(X=6)=0.1042 | CDF(5)=0.7851
Essential Continuous Distributions
Continuous distributions describe variables that can take any real value within a range. These appear everywhere in ML — from modeling noise in regression to describing the weights inside a neural network.
PDF:
f(x) = 1 / (b−a)Mean:
E[X] = (a+b) / 2Variance:
(b−a)² / 12ML uses: Random weight initialization, random search over hyperparameters, random number generation.
PDF:
f(x) = (1/σ√2π) exp(−(x−μ)²/2σ²)Mean:
E[X] = μVariance:
Var(X) = σ²ML uses: Noise modeling, weight initialization (Xavier/He), Gaussian processes, variational autoencoders.
PDF:
f(x) = λ e⁻λˣMean:
E[X] = 1/λVariance:
1/λ²ML uses: Time-to-failure modeling, survival analysis, network latency models.
PDF:
f(x) = xα⁻¹(1−x)β⁻¹ / B(α,β)Mean:
E[X] = α / (α+β)ML uses: Bayesian inference (conjugate prior for Bernoulli), modeling click-through rates, variational inference.
PDF:
f(x) = xᵏ⁻¹ e⁻ˣ/θ / (θᵏ Γ(k))Mean:
E[X] = kθML uses: Conjugate prior for Poisson rate λ, Bayesian neural networks, topic modeling priors.
Mean:
E[X] = 0 (for ν > 1)Variance:
ν/(ν−2) (for ν > 2)ML uses: Small-sample statistics, robust regression, t-SNE dimensionality reduction, Bayesian modeling with outlier robustness.
The Normal Distribution — A Deep Dive
The Normal (Gaussian) distribution deserves special attention because it appears everywhere in ML — from the Central Limit Theorem to weight initialization to noise modeling. Its bell-shaped curve is defined entirely by two parameters: the mean μ (location) and the standard deviation σ (spread).
The Standard Normal and Z-Score
Any Normal distribution can be converted to the Standard Normal N(0,1) by computing the Z-score. This is the basis of feature standardization (one of the most common preprocessing steps in ML):
Covariance and Correlation
When working with multiple features, we need to understand how they relate to each other. Covariance and correlation measure the linear relationship between two random variables. These concepts underpin PCA, feature selection, and multicollinearity detection.
The Central Limit Theorem (CLT)
The Central Limit Theorem is one of the most profound results in all of mathematics. It explains why the Normal distribution appears everywhere in nature and in Machine Learning, even when the underlying data is not Gaussian.
“Given a sufficiently large sample from any population with a finite mean and variance, the distribution of the sample mean will be approximately Normal, regardless of the underlying population distribution.”
— Central Limit Theorem
1import numpy as np 2import matplotlib.pyplot as plt 3 4np.random.seed(42) 5 6# The underlying population: Exponential (highly skewed, NOT Gaussian) 7population = np.random.exponential(scale=2.0, size=100_000) 8print(f"Population skewness: {population.mean():.2f}") # ~2.0 9 10# Draw 10,000 samples of size n, compute each sample mean 11for n in [2, 10, 30, 100]: 12 sample_means = [ 13 np.mean(np.random.choice(population, n)) 14 for _ in range(10_000) 15 ] 16 sm = np.array(sample_means) 17 print(f"n={n:3d}: mean={sm.mean():.3f}, std={sm.std():.3f}, (theory σ/√n={2.0/np.sqrt(n):.3f})") 18# → n= 2: mean=1.998, std=1.412, (theory σ/√n=1.414) 19# → n= 10: mean=2.003, std=0.633, (theory σ/√n=0.632) 20# → n= 30: mean=1.998, std=0.364, (theory σ/√n=0.365) 21# → n=100: mean=2.001, std=0.200, (theory σ/√n=0.200) 22# Sample means converge to Normal perfectly — even for skewed Exponential!
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation is the most widely used method for fitting a probability distribution to data. The core idea: given observed data, choose the distribution parameters that make the observed data most probable. MLE is the theoretical foundation behind training logistic regression, neural networks (with cross-entropy loss), and most other supervised ML algorithms.
MLE — The Core Idea
Data X
Likelihood L(θ|X)
(log-likelihood)
dℓ/dθ = 0
Estimate
MLE and cross-entropy loss are the same thing. When you train a classifier with cross-entropy loss, you are performing Maximum Likelihood Estimation of the parameters θ (weights) under a categorical (Bernoulli for binary) distribution. Minimizing cross-entropy = maximizing the log-likelihood. This is why MLE is not just academic theory — it is the engine running inside every neural network you train.
1import numpy as np 2from scipy import stats 3 4# ── MLE FOR GAUSSIAN ─────────────────────────────────────── 5np.random.seed(0) 6true_mu, true_sigma = 5.0, 2.0 7X = np.random.normal(true_mu, true_sigma, size=1000) 8 9# MLE estimates = sample mean and std 10mu_mle = X.mean() 11sigma_mle = X.std() # ddof=0 is MLE (biased); ddof=1 is unbiased 12print(f"True μ={true_mu}, MLE μ̂={mu_mle:.3f}") 13print(f"True σ={true_sigma}, MLE σ̂={sigma_mle:.3f}") 14# → True μ=5.0, MLE μ̂=5.012 | True σ=2.0, MLE σ̂=2.003 15 16# ── BAYES THEOREM — COIN BIAS INFERENCE ──────────────────── 17# Question: is this coin fair? We observe 7 heads in 10 flips 18n_flips, n_heads = 10, 7 19 20# Grid of possible bias values p ∈ [0, 1] 21p_grid = np.linspace(0, 1, 1000) 22 23# Prior: flat Uniform prior (no prior knowledge) 24prior = np.ones(len(p_grid)) 25 26# Likelihood: Binomial — P(7 heads | p) 27likelihood = stats.binom.pmf(n_heads, n_flips, p_grid) 28 29# Posterior (unnormalized) = likelihood * prior 30posterior = likelihood * prior 31posterior_norm = posterior / posterior.sum() # normalize 32 33# MAP estimate = p with highest posterior probability 34p_map = p_grid[np.argmax(posterior_norm)] 35print(f"MAP estimate of coin bias p = {p_map:.3f}") 36# → MAP estimate of coin bias p = 0.700 (= MLE: 7/10)
Complete Distributions Reference for Machine Learning
The following table serves as a quick-reference guide mapping each distribution to its parameters, summary statistics, and primary ML applications. Bookmark this table.
| Distribution | Type | Parameters | Mean | Variance | Key ML Use |
|---|---|---|---|---|---|
| Bernoulli | Discrete | p ∈ [0,1] | p | p(1−p) | Binary classification, logistic regression |
| Binomial | Discrete | n, p | np | np(1−p) | A/B testing, bootstrapping, count outcomes |
| Multinomial | Discrete | n, p₁…p₊ | npᵢ | npᵢ(1−pᵢ) | Multi-class classification, NLP word counts |
| Poisson | Discrete | λ > 0 | λ | λ | Count regression, rare event modeling |
| Geometric | Discrete | p ∈ (0,1] | 1/p | (1−p)/p² | Modeling number of trials until first success |
| Uniform | Continuous | a, b | (a+b)/2 | (b−a)²/12 | Random initialization, random search |
| Normal (Gaussian) | Continuous | μ, σ² | μ | σ² | Noise model, weight init, VAE latent space |
| Exponential | Continuous | λ > 0 | 1/λ | 1/λ² | Time-to-event, survival analysis |
| Beta | Continuous | α, β > 0 | α/(α+β) | αβ/((α+β)²(α+β+1)) | Bayesian prior for probabilities, click rates |
| Gamma | Continuous | k, θ | kθ | kθ² | Prior for rates, positive-valued outcomes |
| Dirichlet | Continuous | α₁…α₊ | αᵢ/∑αᵢ | — | Prior over categorical distributions, LDA topic models |
| Student’s t | Continuous | ν (degrees of freedom) | 0 (ν>1) | ν/(ν−2) (ν>2) | Small-sample inference, t-SNE, robust models |
| Chi-Squared | Continuous | k (degrees) | k | 2k | Goodness-of-fit tests, feature selection (χ² test) |
| Multivariate Gaussian | Multivariate | μ vector, Σ matrix | μ | Σ | GMMs, Gaussian processes, LDA, Kalman filters |
Working with Distributions in Python
The scipy.stats module provides a consistent API for all common distributions. Every distribution object supports the same four core methods:
1import numpy as np 2from scipy import stats 3 4# ── NORMAL DISTRIBUTION ──────────────────────────────────── 5norm = stats.norm(loc=0, scale=1) # Standard Normal N(0,1) 6print(f"P(Z < 0) = {norm.cdf(0):.4f}") # → 0.5000 7print(f"P(-1 < Z < 1) = {norm.cdf(1) - norm.cdf(-1):.4f}") # → 0.6827 8print(f"P(-2 < Z < 2) = {norm.cdf(2) - norm.cdf(-2):.4f}") # → 0.9545 9print(f"95th percentile = {norm.ppf(0.95):.4f}") # → 1.6449 10samples = norm.rvs(size=1000) # draw 1000 samples 11 12# ── BETA DISTRIBUTION ────────────────────────────────────── 13# Use case: modeling click-through rate (CTR) for a web button 14# Prior belief: 10 clicks out of 100 impressions observed 15alpha, beta_param = 10, 90 # successes=10, failures=90 16beta_dist = stats.beta(alpha, beta_param) 17print(f"Beta({alpha},{beta_param}): Mean={beta_dist.mean():.3f}") 18print(f"95% CI for CTR: [{beta_dist.ppf(0.025):.3f}, {beta_dist.ppf(0.975):.3f}]") 19# → Mean=0.1 | 95% CI: [0.049, 0.169] 20 21# ── FITTING A DISTRIBUTION TO DATA (MLE via scipy) ───────── 22data = np.random.exponential(scale=3.0, size=500) 23# scipy's fit() performs MLE automatically 24loc_fit, scale_fit = stats.expon.fit(data, floc=0) 25print(f"MLE scale (1/λ) = {scale_fit:.3f}") # → ~3.0 26 27# ── MULTIVARIATE NORMAL ──────────────────────────────────── 28from scipy.stats import multivariate_normal 29mu = [0, 0] 30Sigma = [[1, 0.8], [0.8, 1]] # covariance matrix (correlated features) 31mvn = multivariate_normal(mean=mu, cov=Sigma) 32x2d = mvn.rvs(size=1000) # 2D samples; used in GMM, PCA 33print(f"Sample correlation: {np.corrcoef(x2d[:,0], x2d[:,1])[0,1]:.3f}") # → ~0.800
How These Distributions Show Up Inside ML Algorithms
The following table maps specific ML algorithms and components to the probability distributions they explicitly use or implicitly assume. This is the bridge between probability theory and practical ML.
| ML Algorithm / Component | Distribution Used | How It Is Used |
|---|---|---|
| Logistic Regression | Bernoulli | Output is P(Y=1|X). Loss = negative log-likelihood of Bernoulli = binary cross-entropy |
| Softmax / Multi-class | Multinomial | Output layer gives P(Y=k|X) for each class. Loss = categorical cross-entropy |
| Naive Bayes Classifier | Gaussian / Multinomial | Assumes each feature follows a Gaussian (Gaussian NB) or Multinomial (text NB) distribution |
| Linear Regression (OLS) | Normal | Assumes residuals εᵢ ~ N(0, σ²). MSE minimization is MLE under this assumption |
| Ridge Regression (L2) | Normal (Prior) | Equivalent to placing a Gaussian prior N(0, λ) on weights = MAP estimation |
| Lasso Regression (L1) | Laplace (Prior) | Equivalent to placing a Laplace prior on weights = MAP estimation with sparsity |
| Gaussian Mixture Models | Multivariate Normal | Data is modeled as mixture of K Gaussian components; E-M algorithm estimates parameters |
| Weight Initialization (He/Xavier) | Normal / Uniform | Weights initialized from N(0, 2/nᵢₙ) or Uniform[−√6/n, +√6/n] to maintain gradient flow |
| Variational Autoencoder (VAE) | Normal | Latent space z ~ N(0,I). The ELBO loss contains KL divergence between posterior and prior |
| Dropout Regularization | Bernoulli | Each neuron is dropped with Bernoulli(p) independently during training |
| t-SNE | Normal + Student-t | High-dim similarities use Gaussian kernel; low-dim similarities use heavier-tailed t(1) |
| LDA Topic Modeling | Dirichlet + Multinomial | Topic distributions θᵏ ~ Dirichlet(α); word distributions ϕᵏ ~ Dirichlet(β) |
Conjugate Priors — Closed-Form Bayesian Updates
A conjugate prior is a prior distribution such that the posterior (after observing data) has the same parametric form as the prior — only with updated parameters. This gives us closed-form Bayesian updates without integration, which is computationally invaluable.
Posterior: Beta(α + successes, β + failures)
Start with Beta(1,1) = Uniform prior. After observing 7 heads and 3 tails, posterior = Beta(8, 4). The Beta parameters simply count the outcomes.
Posterior: Gamma(α + ∑xᵢ, β + n)
Used in modeling event rates λ. After observing n intervals with ∑xᵢ total events, update both shape and rate parameters.
Posterior: Normal with updated mean and variance
Used in Kalman filters, Gaussian processes, and Bayesian linear regression. The posterior mean is a precision-weighted average of prior and data.
Complete Practical Example — Spam Classifier with Naive Bayes
Let us tie everything together: conditional probability, Bayes’ theorem, and the Gaussian distribution all combine inside the Gaussian Naive Bayes classifier. Here is the full probabilistic reasoning chain with annotated code.
Gaussian Naive Bayes — Probabilistic Chain
P(class)
P(X|class)=N(μ,σ²)
Posterior
/ P(X)
Class
1import numpy as np 2from scipy import stats 3 4class GaussianNaiveBayesScratch: 5 """Gaussian NB: Each feature|class ~ Normal(μ, σ²)""" 6 7 def fit(self, X, y): 8 self.classes = np.unique(y) 9 self.priors = {} # P(class) — from data frequency 10 self.means = {} # μ per feature per class — MLE estimate 11 self.stds = {} # σ per feature per class — MLE estimate 12 for c in self.classes: 13 X_c = X[y == c] 14 self.priors[c] = len(X_c) / len(X) # P(class) 15 self.means[c] = X_c.mean(axis=0) # MLE μ per feature 16 self.stds[c] = X_c.std(axis=0) # MLE σ per feature 17 18 def predict(self, X): 19 preds = [] 20 for x in X: 21 log_posteriors = {} 22 for c in self.classes: 23 # log P(class) — log prior 24 log_prior = np.log(self.priors[c]) 25 # sum of log N(x_i | μ_ic, σ_ic) — log likelihood 26 log_like = np.sum( 27 stats.norm(self.means[c], self.stds[c]).logpdf(x) 28 ) 29 log_posteriors[c] = log_prior + log_like # Bayes' theorem 30 preds.append(max(log_posteriors, key=log_posteriors.get)) 31 return np.array(preds) 32 33# ── TEST ON IRIS DATASET ──────────────────────────────────── 34from sklearn.datasets import load_iris 35from sklearn.model_selection import train_test_split 36from sklearn.metrics import accuracy_score 37 38X, y = load_iris(return_X_y=True) 39X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 40 41gnb = GaussianNaiveBayesScratch() 42gnb.fit(X_train, y_train) 43y_pred = gnb.predict(X_test) 44print(f"Accuracy from scratch: {accuracy_score(y_test, y_pred):.2%}") 45# → Accuracy from scratch: 96.67%
What just happened? We built a complete classifier from scratch using only Bayes’ theorem and the Gaussian PDF. The fit() method performs MLE to estimate μ and σ for each feature conditioned on each class. The predict() method applies Bayes’ theorem in log-space (for numerical stability) to find the class with the highest posterior. Result: 96.67% accuracy — from probability theory alone.
Key Takeaways
- Probability theory is the mathematical language of Machine Learning. Every prediction, every loss function, and every training algorithm is rooted in probability.
- The three Kolmogorov axioms — non-negativity, normalization, and additivity — are the foundation from which all probability rules are derived.
- Conditional probability P(A|B) restricts the sample space to B. Bayes’ theorem flips this: it lets us compute P(θ|data) from P(data|θ), which is how probabilistic models are trained.
- Discrete distributions (Bernoulli, Binomial, Poisson) model count and binary outcomes. Continuous distributions (Normal, Beta, Exponential) model measurements and densities.
- The Normal distribution is special: it appears as noise in regression, as weight initializations in neural networks, as the target in VAE latent spaces, and is justified universally by the Central Limit Theorem.
- MLE (Maximum Likelihood Estimation) is the principled method for fitting distributions to data. Minimizing cross-entropy loss is exactly MLE for categorical distributions — which is what every classifier trained with gradient descent does.
- Covariance and correlation measure linear relationships between features and are central to PCA, feature selection, and detecting multicollinearity.
- The Central Limit Theorem guarantees that sample means converge to Normal regardless of the underlying distribution — justifying confidence intervals, t-tests, and the noise properties of mini-batch gradient descent.
What’s Next?
In Chapter 2.4 — Descriptive and Inferential Statistics, we will build on this foundation by studying how to summarize datasets with measures of central tendency and spread, how to draw inferences from samples using hypothesis testing and confidence intervals, and how statistical significance connects to model evaluation in Machine Learning.