Probability Theory and Distributions

Probability is the mathematical language of uncertainty — and Machine Learning is, at its core, a discipline of reasoning under uncertainty. Every prediction a model makes is a statement about probability. Understanding how randomness, likelihood, and distributions work is not optional background knowledge; it is the foundation upon which every algorithm from logistic regression to deep neural networks is built.

Why Does Machine Learning Need Probability Theory?

Real-world data is messy, incomplete, and inherently uncertain. A model rarely has access to perfect information. Probability theory gives us the tools to quantify uncertainty, make principled predictions, and understand how confident we should be in any given output. Here is where probability appears concretely in ML:

Model Outputs
Classifiers output class probabilities, not just labels. A spam filter says “87% likely spam.”
Loss Functions
Cross-entropy loss is derived directly from the log-likelihood of a probability distribution.
Bayesian Learning
Bayesian methods treat model parameters as probability distributions, not fixed values.
Generative Models
VAEs, GANs, and diffusion models are built entirely on probability distributions over data.
Regularization
L2 regularization is equivalent to placing a Gaussian prior on model weights (MAP estimation).

Sample Space, Events, and Probability

Every probability problem begins with a sample space — the set of all possible outcomes of a random experiment. From there, we define events as subsets of the sample space and assign probabilities to them.

Sample Space (Ω)
The complete set of all possible outcomes. For a single coin flip: Ω = {Heads, Tails}. For a die roll: Ω = {1, 2, 3, 4, 5, 6}.
Event (A, B, …)
A subset of the sample space. “Rolling an even number” is the event A = {2, 4, 6} from Ω = {1, 2, 3, 4, 5, 6}.
Probability P(A)
A number between 0 and 1 that measures how likely event A is to occur. P(even die) = 3/6 = 0.5.

The Three Axioms of Probability (Kolmogorov, 1933)

All of modern probability theory rests on three foundational axioms formalized by the Russian mathematician Andrei Kolmogorov. Every rule you use — from adding probabilities to Bayes’ theorem — is derived from just these three statements.

Axiom 1
Non-Negativity
P(A) >= 0 for all events A
The probability of any event is always a non-negative real number. Probabilities can never be negative. A probability of 0 means “impossible”; a probability of 1 means “certain.”
Axiom 2
Normalization
P(Ω) = 1
The probability of the entire sample space — i.e., the probability that some outcome occurs — is exactly 1. All probabilities must sum to 1 across all possible outcomes.
Axiom 3
Countable Additivity
P(A ∪ B) = P(A) + P(B)
For mutually exclusive events A and B (where A ∩ B = ∅), the probability of either occurring equals the sum of their individual probabilities. If events can overlap, we must subtract the intersection.

Key derivation: The general addition rule for non-mutually-exclusive events follows directly from Axiom 3:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) We subtract P(A ∩ B) to avoid double-counting the overlap region. Without this correction, elements in both A and B would be counted twice.


Three Interpretations of Probability

There are three main philosophical interpretations of what a probability number actually means. Machine Learning draws from all three depending on context.

Classical (Frequentist)

Interpretation: Probability is the long-run frequency of an event over many repeated trials.

Flip a fair coin 10,000 times
Heads appears ~5,000 times
P(Heads) = 0.5 as n → ∞

ML use: Training accuracy, confidence intervals, hypothesis tests.

Bayesian (Subjective)

Interpretation: Probability represents a degree of belief, updated as new evidence arrives.

Start with a prior belief (P = 0.5)
Observe data (3 heads in a row)
Update belief: P(biased coin) = 0.89

ML use: Bayesian neural networks, Gaussian processes, Naive Bayes.


Conditional Probability

Conditional probability answers the question: given that we know event B has already occurred, how does that change the probability of event A? This is arguably the single most important concept in probabilistic Machine Learning.

Conditional Probability Formula
P(A | B) = P(A B) / P(B) where P(B) > 0 Read as: “Probability of A given B” Restricts the sample space from Ω to only the outcomes in B then asks: what fraction of those outcomes also lie in A?

Conditional Probability — Visual Intuition

Ω A only A∩B B only A B P(A|B): restrict to B, find A∩B

P(A|B) zooms in on circle B. Within that restricted universe, what fraction of area is also in A?

Independence

Two events A and B are statistically independent if knowing that B occurred gives you no information about whether A occurred. Formally:

Statistical Independence
P(A B) = P(A) × P(B) if and only if A and B are independent Equivalently: P(A | B) = P(A) and P(B | A) = P(B) Knowing B provides zero information about A

Critical ML connection: The Naive Bayes classifier is called “naive” precisely because it assumes all input features are conditionally independent given the class label — a simplification that rarely holds exactly in real data, but works surprisingly well in practice. Every time you compute a likelihood as a product of per-feature probabilities, you are invoking independence.


Bayes’ Theorem — The Engine of Probabilistic Reasoning

Bayes’ theorem is the formal rule for updating a belief (probability) when new evidence arrives. It is derived entirely from the definition of conditional probability and is arguably the most important formula in all of Machine Learning and statistics.

Bayes’ Theorem
P(θ | X) = [ P(X | θ) × P(θ) ] / P(X) where: θ = hypothesis (e.g. model parameters, or class label) X = observed data / evidence P(θ|X) = Posterior — updated belief after seeing data P(X|θ) = Likelihood — probability of data given hypothesis P(θ) = Prior — initial belief before seeing data P(X) = Evidence — normalizing constant (marginal likelihood)

Bayes’ Theorem — Component Flow

Prior
P(θ)
××
Likelihood
P(X|θ)
//
Evidence
P(X)
==
Posterior
P(θ|X)

The posterior combines our prior belief with the evidence from data. As more data arrives, the posterior dominates and the prior becomes less influential.

A Concrete Medical Diagnosis Example

A disease affects 1% of the population. A test for it is 95% accurate (true positive rate) and has a 5% false positive rate. A patient tests positive. What is the actual probability they have the disease?

Bayes’ Theorem — Medical Diagnosis
P(Disease) = 0.01 (prior: 1% prevalence) P(+Test | Disease) = 0.95 (likelihood: sensitivity) P(+Test | No Disease) = 0.05 (false positive rate) P(+Test) = P(+|Dis)×P(Dis) + P(+|No Dis)×P(No Dis) = 0.95×0.01 + 0.05×0.99 = 0.0095 + 0.0495 = 0.059 P(Disease | +Test) = (0.95 × 0.01) / 0.059 = 0.0095 / 0.059 ≈ 0.161 (only ~16%!)

Counterintuitive result: Despite a 95% accurate test, a positive result only means a 16% chance of actually having the disease — because the disease is rare. This is why prevalence (the prior) matters enormously. Spam filters, fraud detectors, and medical AI systems all wrestle with exactly this problem: the rare-event trap.


Random Variables — Formalizing Uncertainty

A random variable is a function that maps outcomes from a sample space to numerical values. Rather than talking about “the outcome of rolling a die,” we define X = “the number showing on top” and work with its numerical properties. Random variables are either discrete or continuous.

Discrete Random Variable

Takes on a countable number of distinct values. Each outcome has a specific, non-zero probability.

X = number on a die: {1, 2, 3, 4, 5, 6}
Y = number of defects per batch
Z = count of spam emails per day
Described by a PMF — P(X = x)

The sum of all PMF values = 1.

Continuous Random Variable

Takes on any value within an interval. The probability of any single exact value is zero; we measure probability over ranges.

X = height of a person (cm)
Y = temperature in Bangalore at noon
Z = time until a server responds (ms)
Described by a PDF — f(x)

The integral of the PDF over all x = 1.

PMF, PDF, and CDF — The Three Core Functions

Function Full Name Applies To Definition Key Property
PMF Probability Mass Function Discrete P(X = x) — exact probability at point x ∑ P(X=x) = 1 over all x
PDF Probability Density Function Continuous f(x) such that P(a ≤ X ≤ b) = ∫ f(x)dx ∫ f(x)dx = 1 from −∞ to +∞
CDF Cumulative Distribution Function Both F(x) = P(X ≤ x) — probability up to x Monotone non-decreasing; F(−∞)=0, F(+∞)=1

Key insight for ML practitioners: For a continuous distribution, P(X = 5.0000) = 0 — the probability of any exact single value is zero. This is why we always speak of probabilities over intervals (e.g., “probability that height is between 170 and 180 cm”) and why the PDF itself can have values greater than 1, as long as the integral over all space equals 1.


Expected Value, Variance, and Standard Deviation

Once we have a distribution, we want to summarize it with a few key numbers. The expected value tells us the center; the variance tells us the spread. These are the building blocks of every loss function and evaluation metric in ML.

Expected Value (Mean)
E[X] = μ
Weighted average of all outcomes, weighted by their probabilities.
Variance
Var(X) = σ²
Average squared deviation from the mean. Measures spread/dispersion.
Standard Deviation
σ = √Var(X)
Same units as X. Easier to interpret than variance.
Computational Formula
Var = E[X²] − μ²
Often more efficient than computing deviations directly.
Expectation and Variance — Discrete and Continuous
=== DISCRETE CASE === E[X] = x · P(X = x) sum over all values x Var(X) = (x − μ)² · P(X = x) === CONTINUOUS CASE === E[X] = x · f(x) dx integral over all x Var(X) = (x − μ)² · f(x) dx === LINEARITY OF EXPECTATION (always holds) === E[aX + b] = a·E[X] + b even for dependent variables! Var(aX + b) = a² · Var(X)

Essential Discrete Distributions

These three discrete distributions are the workhorses of classification, counting problems, and binary outcomes in Machine Learning. You will encounter them repeatedly across algorithms, loss functions, and probabilistic models.

Bernoulli Distribution
X ~ Bernoulli(p)
Description: Models a single binary trial — success (1) with probability p, failure (0) with probability 1−p.

PMF: P(X=1) = p, P(X=0) = 1−p
Mean: E[X] = p
Variance: Var(X) = p(1−p)

ML uses: Binary classification output layer (sigmoid), binary cross-entropy loss, logistic regression.
Binomial Distribution
X ~ Binomial(n, p)
Description: Counts the number of successes in n independent Bernoulli trials, each with probability p.

PMF: P(X=k) = C(n,k) pᵏ (1−p)ⁿ⁻ᵏ
Mean: E[X] = np
Variance: Var(X) = np(1−p)

ML uses: Modelling success counts in batch experiments, A/B testing, bootstrapping (sampling with replacement).
Poisson Distribution
X ~ Poisson(λ)
Description: Models the number of events occurring in a fixed interval when events happen at a constant average rate λ and independently of each other.

PMF: P(X=k) = e⁻λ λᵏ / k!
Mean: E[X] = λ
Variance: Var(X) = λ

ML uses: Count regression, modelling rare events, NLP word frequency models, neural network activation counts.
Python — Discrete Distributions with scipy.stats
1from scipy import stats
2import numpy as np
3
4# ── BERNOULLI DISTRIBUTION ─────────────────────────────────
5p = 0.7
6bern = stats.bernoulli(p)
7print(f"Bernoulli(p=0.7): Mean={bern.mean()}, Var={bern.var()}")
8print(f"P(X=1)={bern.pmf(1)}, P(X=0)={bern.pmf(0)}")
9# → Mean=0.7, Var=0.21 | P(X=1)=0.7, P(X=0)=0.3
10
11# ── BINOMIAL DISTRIBUTION ──────────────────────────────────
12n, p = 10, 0.3
13binom = stats.binom(n, p)
14print(f"Binomial(10, 0.3): Mean={binom.mean()}, Var={binom.var()}")
15print(f"P(X=3 successes)={binom.pmf(3):.4f}")
16# → Mean=3.0, Var=2.1 | P(X=3)=0.2668
17
18# ── POISSON DISTRIBUTION ───────────────────────────────────
19lam = 4  # average 4 emails per hour
20pois = stats.poisson(lam)
21print(f"Poisson(λ=4): Mean={pois.mean()}, Var={pois.var()}")
22print(f"P(exactly 6 emails)={pois.pmf(6):.4f}")
23print(f"P(at most 5 emails) = {pois.cdf(5):.4f}")
24# → Mean=4, Var=4 | P(X=6)=0.1042 | CDF(5)=0.7851

Essential Continuous Distributions

Continuous distributions describe variables that can take any real value within a range. These appear everywhere in ML — from modeling noise in regression to describing the weights inside a neural network.

Uniform Distribution
X ~ Uniform(a, b)
All values in [a, b] are equally likely. The simplest non-trivial continuous distribution.

PDF: f(x) = 1 / (b−a)
Mean: E[X] = (a+b) / 2
Variance: (b−a)² / 12

ML uses: Random weight initialization, random search over hyperparameters, random number generation.
Normal (Gaussian) Distribution
X ~ N(μ, σ²)
The most important distribution in statistics and ML. Bell-shaped, symmetric around mean μ with spread σ.

PDF: f(x) = (1/σ√2π) exp(−(x−μ)²/2σ²)
Mean: E[X] = μ
Variance: Var(X) = σ²

ML uses: Noise modeling, weight initialization (Xavier/He), Gaussian processes, variational autoencoders.
Exponential Distribution
X ~ Exp(λ)
Models the time between events in a Poisson process. Memoryless — the probability of an event in the next interval does not depend on how long you have already waited.

PDF: f(x) = λ e⁻λˣ
Mean: E[X] = 1/λ
Variance: 1/λ²

ML uses: Time-to-failure modeling, survival analysis, network latency models.
Beta Distribution
X ~ Beta(α, β)
Defined on [0,1]. Flexible shape controlled by α and β. Perfect for modeling probabilities themselves as random variables.

PDF: f(x) = xα⁻¹(1−x)β⁻¹ / B(α,β)
Mean: E[X] = α / (α+β)

ML uses: Bayesian inference (conjugate prior for Bernoulli), modeling click-through rates, variational inference.
Gamma Distribution
X ~ Gamma(k, θ)
Generalization of the exponential. Models the time to the k-th event in a Poisson process. Always positive-valued.

PDF: f(x) = xᵏ⁻¹ e⁻ˣ/θ / (θᵏ Γ(k))
Mean: E[X] = kθ

ML uses: Conjugate prior for Poisson rate λ, Bayesian neural networks, topic modeling priors.
Student’s t-Distribution
X ~ t(ν)
Similar to Normal but with heavier tails, controlled by degrees of freedom ν. As ν→∞, converges to Normal(0,1).

Mean: E[X] = 0 (for ν > 1)
Variance: ν/(ν−2) (for ν > 2)

ML uses: Small-sample statistics, robust regression, t-SNE dimensionality reduction, Bayesian modeling with outlier robustness.

The Normal Distribution — A Deep Dive

The Normal (Gaussian) distribution deserves special attention because it appears everywhere in ML — from the Central Limit Theorem to weight initialization to noise modeling. Its bell-shaped curve is defined entirely by two parameters: the mean μ (location) and the standard deviation σ (spread).

Normal Distribution Bell Curve — N(0, 1) Standard Normal

−3σ −2σ −σ μ = 0 +2σ +3σ
μ ± 1σ covers 68.27% of data
μ ± 2σ covers 95.45% of data
μ ± 3σ covers 99.73% of data

The Standard Normal and Z-Score

Any Normal distribution can be converted to the Standard Normal N(0,1) by computing the Z-score. This is the basis of feature standardization (one of the most common preprocessing steps in ML):

Z-Score Standardization
Z = (X − μ) / σ transforms any Normal(μ, σ²) to Standard Normal(0,1) Interpretation of Z: Z = 0.0 means X is exactly at the mean Z = +2.0 means X is 2 standard deviations above the mean Z = -1.5 means X is 1.5 standard deviations below the mean In scikit-learn: StandardScaler().fit_transform(X) does exactly this

Covariance and Correlation

When working with multiple features, we need to understand how they relate to each other. Covariance and correlation measure the linear relationship between two random variables. These concepts underpin PCA, feature selection, and multicollinearity detection.

Covariance and Pearson Correlation Coefficient
=== COVARIANCE === Cov(X, Y) = E[(X − μₓ)(Y − μₔ)] = E[XY] − E[X]·E[Y] Cov > 0: X and Y tend to increase together (positive relationship) Cov < 0: As X increases, Y tends to decrease Cov = 0: No linear relationship (but may have non-linear relationship) === PEARSON CORRELATION (normalized covariance) === ρ(X, Y) = Cov(X, Y) / (σₓ · σₔ) always in [−1, +1] ρ = +1: Perfect positive linear relationship ρ = 0: No linear relationship ρ = −1: Perfect negative linear relationship
Why covariance matters in ML
PCA computes the covariance matrix of features and decomposes it to find the principal components — the directions of maximum variance in the data.
Multicollinearity warning
High correlation (|ρ| near 1) between input features can destabilize linear models. Always check the correlation matrix during EDA before training.
Correlation is not causation
Ice cream sales and drowning rates are correlated (both increase in summer). A model finding this pattern would be useless. Features must be interpreted carefully.

The Central Limit Theorem (CLT)

The Central Limit Theorem is one of the most profound results in all of mathematics. It explains why the Normal distribution appears everywhere in nature and in Machine Learning, even when the underlying data is not Gaussian.

“Given a sufficiently large sample from any population with a finite mean and variance, the distribution of the sample mean will be approximately Normal, regardless of the underlying population distribution.”

— Central Limit Theorem

Central Limit Theorem — Formal Statement
Let X₁, X₂, …, Xₙ be i.i.d. random variables with mean μ and finite variance σ². = (X₁ + X₂ + … + Xₙ) / n (sample mean) As n → ∞: ~ N(μ, σ² / n) approximately Normal The standard error of the mean = σ / √n (shrinks as n grows)
Why the CLT Is Critical for Machine Learning
1
Justifies Gaussian Assumptions
Many ML algorithms (linear regression, LDA, Gaussian Naive Bayes) assume Gaussian noise or features. The CLT justifies this — real-world measurements are often sums of many small effects, making them approximately Gaussian.
2
Enables Confidence Intervals
When you evaluate a model on a test set, you are computing a sample mean (e.g., average accuracy). The CLT lets you build confidence intervals around that estimate and reason about how it generalizes.
3
Justifies Mini-Batch SGD
Mini-batch gradient descent uses small random samples to estimate the true gradient. The CLT guarantees that as batch size grows, the noisy gradient estimate converges to the true full-batch gradient in distribution.
4
Foundation of Hypothesis Testing
A/B testing for model comparison, t-tests for feature significance, and ANOVA for hyperparameter analysis all rely on the CLT to determine whether observed differences are statistically significant.
Python — Central Limit Theorem Demonstration
1import numpy as np
2import matplotlib.pyplot as plt
3
4np.random.seed(42)
5
6# The underlying population: Exponential (highly skewed, NOT Gaussian)
7population = np.random.exponential(scale=2.0, size=100_000)
8print(f"Population skewness: {population.mean():.2f}")  # ~2.0
9
10# Draw 10,000 samples of size n, compute each sample mean
11for n in [2, 10, 30, 100]:
12    sample_means = [
13        np.mean(np.random.choice(population, n))
14        for _ in range(10_000)
15    ]
16    sm = np.array(sample_means)
17    print(f"n={n:3d}: mean={sm.mean():.3f}, std={sm.std():.3f}, (theory σ/√n={2.0/np.sqrt(n):.3f})")
18# → n=  2: mean=1.998, std=1.412, (theory σ/√n=1.414)
19# → n= 10: mean=2.003, std=0.633, (theory σ/√n=0.632)
20# → n= 30: mean=1.998, std=0.364, (theory σ/√n=0.365)
21# → n=100: mean=2.001, std=0.200, (theory σ/√n=0.200)
22# Sample means converge to Normal perfectly — even for skewed Exponential!

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is the most widely used method for fitting a probability distribution to data. The core idea: given observed data, choose the distribution parameters that make the observed data most probable. MLE is the theoretical foundation behind training logistic regression, neural networks (with cross-entropy loss), and most other supervised ML algorithms.

MLE — The Core Idea

Observed
Data X
Write
Likelihood L(θ|X)
Take Log
(log-likelihood)
Maximize:
dℓ/dθ = 0
θ̂ MLE
Estimate
MLE — General Formulation
=== LIKELIHOOD FUNCTION === L(θ | X) = P(xᵢ | θ) product over all n observations (assumes i.i.d. data) === LOG-LIKELIHOOD (more numerically stable) === (θ | X) = log L(θ | X) = log P(xᵢ | θ) === MLE ESTIMATOR === θ̂_MLE = argmax (θ | X) find θ that maximizes log-likelihood === EXAMPLE: MLE for Gaussian N(μ, σ²) === μ̂_MLE = (1/n) × xᵢ = sample mean σ²̂_MLE = (1/n) × (xᵢ − μ̂)² = sample variance (biased)

MLE and cross-entropy loss are the same thing. When you train a classifier with cross-entropy loss, you are performing Maximum Likelihood Estimation of the parameters θ (weights) under a categorical (Bernoulli for binary) distribution. Minimizing cross-entropy = maximizing the log-likelihood. This is why MLE is not just academic theory — it is the engine running inside every neural network you train.

Python — MLE for Gaussian and Bayes’ Theorem
1import numpy as np
2from scipy import stats
3
4# ── MLE FOR GAUSSIAN ───────────────────────────────────────
5np.random.seed(0)
6true_mu, true_sigma = 5.0, 2.0
7X = np.random.normal(true_mu, true_sigma, size=1000)
8
9# MLE estimates = sample mean and std
10mu_mle    = X.mean()
11sigma_mle = X.std()  # ddof=0 is MLE (biased); ddof=1 is unbiased
12print(f"True μ={true_mu}, MLE μ̂={mu_mle:.3f}")
13print(f"True σ={true_sigma}, MLE σ̂={sigma_mle:.3f}")
14# → True μ=5.0, MLE μ̂=5.012 | True σ=2.0, MLE σ̂=2.003
15
16# ── BAYES THEOREM — COIN BIAS INFERENCE ────────────────────
17# Question: is this coin fair? We observe 7 heads in 10 flips
18n_flips, n_heads = 10, 7
19
20# Grid of possible bias values p ∈ [0, 1]
21p_grid   = np.linspace(0, 1, 1000)
22
23# Prior: flat Uniform prior (no prior knowledge)
24prior    = np.ones(len(p_grid))
25
26# Likelihood: Binomial — P(7 heads | p)
27likelihood = stats.binom.pmf(n_heads, n_flips, p_grid)
28
29# Posterior (unnormalized) = likelihood * prior
30posterior     = likelihood * prior
31posterior_norm = posterior / posterior.sum()  # normalize
32
33# MAP estimate = p with highest posterior probability
34p_map = p_grid[np.argmax(posterior_norm)]
35print(f"MAP estimate of coin bias p = {p_map:.3f}")
36# → MAP estimate of coin bias p = 0.700  (= MLE: 7/10)

Complete Distributions Reference for Machine Learning

The following table serves as a quick-reference guide mapping each distribution to its parameters, summary statistics, and primary ML applications. Bookmark this table.

Distribution Type Parameters Mean Variance Key ML Use
Bernoulli Discrete p ∈ [0,1] p p(1−p) Binary classification, logistic regression
Binomial Discrete n, p np np(1−p) A/B testing, bootstrapping, count outcomes
Multinomial Discrete n, p₁…p₊ npᵢ npᵢ(1−pᵢ) Multi-class classification, NLP word counts
Poisson Discrete λ > 0 λ λ Count regression, rare event modeling
Geometric Discrete p ∈ (0,1] 1/p (1−p)/p² Modeling number of trials until first success
Uniform Continuous a, b (a+b)/2 (b−a)²/12 Random initialization, random search
Normal (Gaussian) Continuous μ, σ² μ σ² Noise model, weight init, VAE latent space
Exponential Continuous λ > 0 1/λ 1/λ² Time-to-event, survival analysis
Beta Continuous α, β > 0 α/(α+β) αβ/((α+β)²(α+β+1)) Bayesian prior for probabilities, click rates
Gamma Continuous k, θ kθ² Prior for rates, positive-valued outcomes
Dirichlet Continuous α₁…α₊ αᵢ/∑αᵢ Prior over categorical distributions, LDA topic models
Student’s t Continuous ν (degrees of freedom) 0 (ν>1) ν/(ν−2) (ν>2) Small-sample inference, t-SNE, robust models
Chi-Squared Continuous k (degrees) k 2k Goodness-of-fit tests, feature selection (χ² test)
Multivariate Gaussian Multivariate μ vector, Σ matrix μ Σ GMMs, Gaussian processes, LDA, Kalman filters

Working with Distributions in Python

The scipy.stats module provides a consistent API for all common distributions. Every distribution object supports the same four core methods:

.pmf(x) / .pdf(x)
P(X = x) or f(x)
Point probability or density at x
.cdf(x)
P(X ≤ x)
Cumulative probability up to x
.ppf(q)
Quantile Function
Inverse CDF — x such that P(X≤x)=q
.rvs(size)
Random Samples
Draw random samples from the distribution
Python — Continuous Distributions: Normal, Beta, Exponential
1import numpy as np
2from scipy import stats
3
4# ── NORMAL DISTRIBUTION ────────────────────────────────────
5norm = stats.norm(loc=0, scale=1)  # Standard Normal N(0,1)
6print(f"P(Z < 0)        = {norm.cdf(0):.4f}")    # → 0.5000
7print(f"P(-1 < Z < 1)  = {norm.cdf(1) - norm.cdf(-1):.4f}")  # → 0.6827
8print(f"P(-2 < Z < 2)  = {norm.cdf(2) - norm.cdf(-2):.4f}")  # → 0.9545
9print(f"95th percentile = {norm.ppf(0.95):.4f}")  # → 1.6449
10samples = norm.rvs(size=1000)          # draw 1000 samples
11
12# ── BETA DISTRIBUTION ──────────────────────────────────────
13# Use case: modeling click-through rate (CTR) for a web button
14# Prior belief: 10 clicks out of 100 impressions observed
15alpha, beta_param = 10, 90  # successes=10, failures=90
16beta_dist = stats.beta(alpha, beta_param)
17print(f"Beta({alpha},{beta_param}): Mean={beta_dist.mean():.3f}")
18print(f"95% CI for CTR: [{beta_dist.ppf(0.025):.3f}, {beta_dist.ppf(0.975):.3f}]")
19# → Mean=0.1 | 95% CI: [0.049, 0.169]
20
21# ── FITTING A DISTRIBUTION TO DATA (MLE via scipy) ─────────
22data = np.random.exponential(scale=3.0, size=500)
23# scipy's fit() performs MLE automatically
24loc_fit, scale_fit = stats.expon.fit(data, floc=0)
25print(f"MLE scale (1/λ) = {scale_fit:.3f}")  # → ~3.0
26
27# ── MULTIVARIATE NORMAL ────────────────────────────────────
28from scipy.stats import multivariate_normal
29mu    = [0, 0]
30Sigma = [[1, 0.8], [0.8, 1]]  # covariance matrix (correlated features)
31mvn   = multivariate_normal(mean=mu, cov=Sigma)
32x2d   = mvn.rvs(size=1000)            # 2D samples; used in GMM, PCA
33print(f"Sample correlation: {np.corrcoef(x2d[:,0], x2d[:,1])[0,1]:.3f}")  # → ~0.800

How These Distributions Show Up Inside ML Algorithms

The following table maps specific ML algorithms and components to the probability distributions they explicitly use or implicitly assume. This is the bridge between probability theory and practical ML.

ML Algorithm / Component Distribution Used How It Is Used
Logistic Regression Bernoulli Output is P(Y=1|X). Loss = negative log-likelihood of Bernoulli = binary cross-entropy
Softmax / Multi-class Multinomial Output layer gives P(Y=k|X) for each class. Loss = categorical cross-entropy
Naive Bayes Classifier Gaussian / Multinomial Assumes each feature follows a Gaussian (Gaussian NB) or Multinomial (text NB) distribution
Linear Regression (OLS) Normal Assumes residuals εᵢ ~ N(0, σ²). MSE minimization is MLE under this assumption
Ridge Regression (L2) Normal (Prior) Equivalent to placing a Gaussian prior N(0, λ) on weights = MAP estimation
Lasso Regression (L1) Laplace (Prior) Equivalent to placing a Laplace prior on weights = MAP estimation with sparsity
Gaussian Mixture Models Multivariate Normal Data is modeled as mixture of K Gaussian components; E-M algorithm estimates parameters
Weight Initialization (He/Xavier) Normal / Uniform Weights initialized from N(0, 2/nᵢₙ) or Uniform[−√6/n, +√6/n] to maintain gradient flow
Variational Autoencoder (VAE) Normal Latent space z ~ N(0,I). The ELBO loss contains KL divergence between posterior and prior
Dropout Regularization Bernoulli Each neuron is dropped with Bernoulli(p) independently during training
t-SNE Normal + Student-t High-dim similarities use Gaussian kernel; low-dim similarities use heavier-tailed t(1)
LDA Topic Modeling Dirichlet + Multinomial Topic distributions θᵏ ~ Dirichlet(α); word distributions ϕᵏ ~ Dirichlet(β)

Conjugate Priors — Closed-Form Bayesian Updates

A conjugate prior is a prior distribution such that the posterior (after observing data) has the same parametric form as the prior — only with updated parameters. This gives us closed-form Bayesian updates without integration, which is computationally invaluable.

Beta-Bernoulli
Prior: Beta(α, β)
Likelihood: Bernoulli(p)
Posterior: Beta(α + successes, β + failures)

Start with Beta(1,1) = Uniform prior. After observing 7 heads and 3 tails, posterior = Beta(8, 4). The Beta parameters simply count the outcomes.
Gamma-Poisson
Prior: Gamma(α, β)
Likelihood: Poisson(λ)
Posterior: Gamma(α + ∑xᵢ, β + n)

Used in modeling event rates λ. After observing n intervals with ∑xᵢ total events, update both shape and rate parameters.
Normal-Normal
Prior: N(μ₀, σ₀²)
Likelihood: N(μ, σ²) (known σ²)
Posterior: Normal with updated mean and variance

Used in Kalman filters, Gaussian processes, and Bayesian linear regression. The posterior mean is a precision-weighted average of prior and data.

Complete Practical Example — Spam Classifier with Naive Bayes

Let us tie everything together: conditional probability, Bayes’ theorem, and the Gaussian distribution all combine inside the Gaussian Naive Bayes classifier. Here is the full probabilistic reasoning chain with annotated code.

Gaussian Naive Bayes — Probabilistic Chain

Prior
P(class)
××
Likelihood
P(X|class)=N(μ,σ²)
Unnormalized
Posterior
Normalize
/ P(X)
Predicted
Class
Python — Naive Bayes from Scratch Using Probability Theory
1import numpy as np
2from scipy import stats
3
4class GaussianNaiveBayesScratch:
5    """Gaussian NB: Each feature|class ~ Normal(μ, σ²)"""
6
7    def fit(self, X, y):
8        self.classes  = np.unique(y)
9        self.priors   = {}   # P(class) — from data frequency
10       self.means    = {}   # μ per feature per class — MLE estimate
11       self.stds     = {}   # σ per feature per class — MLE estimate
12       for c in self.classes:
13           X_c                 = X[y == c]
14           self.priors[c]  = len(X_c) / len(X)  # P(class)
15           self.means[c]   = X_c.mean(axis=0)    # MLE μ per feature
16           self.stds[c]    = X_c.std(axis=0)     # MLE σ per feature
17
18   def predict(self, X):
19       preds = []
20       for x in X:
21           log_posteriors = {}
22           for c in self.classes:
23               # log P(class) — log prior
24               log_prior = np.log(self.priors[c])
25               # sum of log N(x_i | μ_ic, σ_ic) — log likelihood
26               log_like  = np.sum(
27                   stats.norm(self.means[c], self.stds[c]).logpdf(x)
28               )
29               log_posteriors[c] = log_prior + log_like  # Bayes' theorem
30           preds.append(max(log_posteriors, key=log_posteriors.get))
31       return np.array(preds)
32
33# ── TEST ON IRIS DATASET ────────────────────────────────────
34from sklearn.datasets       import load_iris
35from sklearn.model_selection import train_test_split
36from sklearn.metrics        import accuracy_score
37
38X, y                         = load_iris(return_X_y=True)
39X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
40
41gnb      = GaussianNaiveBayesScratch()
42gnb.fit(X_train, y_train)
43y_pred   = gnb.predict(X_test)
44print(f"Accuracy from scratch: {accuracy_score(y_test, y_pred):.2%}")
45# → Accuracy from scratch: 96.67%

What just happened? We built a complete classifier from scratch using only Bayes’ theorem and the Gaussian PDF. The fit() method performs MLE to estimate μ and σ for each feature conditioned on each class. The predict() method applies Bayes’ theorem in log-space (for numerical stability) to find the class with the highest posterior. Result: 96.67% accuracy — from probability theory alone.

Key Takeaways

  • Probability theory is the mathematical language of Machine Learning. Every prediction, every loss function, and every training algorithm is rooted in probability.
  • The three Kolmogorov axioms — non-negativity, normalization, and additivity — are the foundation from which all probability rules are derived.
  • Conditional probability P(A|B) restricts the sample space to B. Bayes’ theorem flips this: it lets us compute P(θ|data) from P(data|θ), which is how probabilistic models are trained.
  • Discrete distributions (Bernoulli, Binomial, Poisson) model count and binary outcomes. Continuous distributions (Normal, Beta, Exponential) model measurements and densities.
  • The Normal distribution is special: it appears as noise in regression, as weight initializations in neural networks, as the target in VAE latent spaces, and is justified universally by the Central Limit Theorem.
  • MLE (Maximum Likelihood Estimation) is the principled method for fitting distributions to data. Minimizing cross-entropy loss is exactly MLE for categorical distributions — which is what every classifier trained with gradient descent does.
  • Covariance and correlation measure linear relationships between features and are central to PCA, feature selection, and detecting multicollinearity.
  • The Central Limit Theorem guarantees that sample means converge to Normal regardless of the underlying distribution — justifying confidence intervals, t-tests, and the noise properties of mini-batch gradient descent.

What’s Next?

In Chapter 2.4 — Descriptive and Inferential Statistics, we will build on this foundation by studying how to summarize datasets with measures of central tendency and spread, how to draw inferences from samples using hypothesis testing and confidence intervals, and how statistical significance connects to model evaluation in Machine Learning.