Types of Machine Learning Systems

Not all machine learning systems learn the same way. They differ by how they are trained (supervised, unsupervised, reinforcement), how they consume data (batch vs. online), and how they generalise (instance-based vs. model-based). Understanding these axes is the foundation for choosing the right algorithm for any real-world problem.

Three Ways to Classify Any ML System

Machine learning systems can be classified along three independent axes. Each axis asks a different question about how the system works.

Axis 1 — Training Signal
How does the model receive feedback? With labels, without labels, or through rewards?
Supervised / Unsupervised / RL
Axis 2 — Learning Mode
Does the model learn all at once from stored data, or continuously as new data arrives?
Batch vs. Online Learning
Axis 3 — Generalisation
Does the model memorise training points, or build an internal mathematical model of the world?
Instance-Based vs. Model-Based

Important: These axes are independent. A single ML system can be supervised AND batch AND model-based. Or online AND unsupervised AND instance-based. You classify a system on all three axes simultaneously.


1
Axis 1: Classification by Training Signal

The most fundamental way to classify ML systems is by the type of feedback the algorithm uses to learn. There are four major paradigms here: Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, and Reinforcement Learning.

1
Supervised Learning
Learn from labelled input-output pairs

In supervised learning, every training example comes with a label — the correct answer the algorithm should produce. The model learns a mapping from inputs to outputs by minimising the difference between its predictions and the known labels. It is called "supervised" because the labels act like a teacher guiding the learning process.

Labelled Data
Input features + correct output labels paired together
Training
Model learns the mapping by minimising prediction error
Learned Model
A mathematical function that maps inputs to outputs
Prediction
Apply to unseen data to produce accurate predictions

Two Sub-Types of Supervised Learning

Classification
The output label is a discrete category. The model learns to assign inputs to one of a fixed set of classes. When there are two classes, it is binary classification; with more, it is multi-class classification.
Examples: spam vs. not spam, cat vs. dog vs. bird, handwritten digit recognition (0-9), disease diagnosis (malignant vs. benign).
Regression
The output is a continuous numerical value. The model learns to predict a number rather than a category. The distance between the prediction and the true value is meaningful.
Examples: house price prediction, stock price forecasting, predicting a patient's blood pressure, estimating a car's fuel efficiency from its specs.

Quick rule: If the answer is a category (yes/no, A/B/C), use classification. If the answer is a number on a continuous scale, use regression. Some algorithms like decision trees and random forests can handle both.

Common Supervised Learning Algorithms

Linear / Logistic Regression
Regression and binary classification respectively. Fast, interpretable, great baselines.
Support Vector Machines (SVM)
Finds the optimal hyperplane maximising the margin between classes. Powerful in high-dimensional spaces.
Decision Trees
Rule-based splits on features. Highly interpretable. Foundation for ensemble methods.
Random Forests
Ensemble of decision trees trained on random subsets. Robust, handles overfitting well.
Neural Networks / Deep Learning
Multi-layered architectures. State-of-the-art for images, text, audio, and video.
K-Nearest Neighbors (KNN)
Classifies a point based on the labels of its K nearest training examples. Simple but powerful.

Supervised Learning: Code Example

Here is a complete example demonstrating both classification and regression using scikit-learn's built-in datasets:

Classification — Iris Flower Dataset

Python — Supervised Classification · SVM · Iris Dataset
1from sklearn.datasets       import load_iris
2from sklearn.model_selection import train_test_split
3from sklearn.svm             import SVC
4from sklearn.metrics         import classification_report
5
6# Load labelled dataset — 150 flowers, 4 features, 3 species
7iris = load_iris()
8X, y = iris.data, iris.target   # labels: 0=setosa, 1=versicolor, 2=virginica
9
10# Split: 80% training, 20% testing
11X_train, X_test, y_train, y_test = train_test_split(
12    X, y, test_size=0.2, random_state=42
13)
14
15# Train a Support Vector Classifier on labelled data
16clf = SVC(kernel='rbf', C=1.0)
17clf.fit(X_train, y_train)
18
19# Evaluate on unseen test data
20y_pred = clf.predict(X_test)
21print(classification_report(y_test, y_pred, target_names=iris.target_names))
22# precision  recall  f1-score  support
23# setosa       1.00    1.00      1.00       10
24# versicolor   1.00    1.00      1.00        9
25# virginica    1.00    1.00      1.00       11

Regression — California Housing Dataset

Python — Supervised Regression · Random Forest · Housing Prices
1from sklearn.datasets       import fetch_california_housing
2from sklearn.model_selection import train_test_split
3from sklearn.ensemble        import RandomForestRegressor
4from sklearn.metrics         import mean_squared_error, r2_score
5import numpy as np
6
7# Target: median house price (a continuous number — regression task)
8housing = fetch_california_housing()
9X_train, X_test, y_train, y_test = train_test_split(
10    housing.data, housing.target, test_size=0.2, random_state=42
11)
12
13reg = RandomForestRegressor(n_estimators=100, random_state=42)
14reg.fit(X_train, y_train)
15
16y_pred = reg.predict(X_test)
17rmse   = np.sqrt(mean_squared_error(y_test, y_pred))
18print(f"RMSE: {rmse:.4f}")        # → RMSE: 0.5035 (units: $100k)
19print(f"R²:   {r2_score(y_test, y_pred):.4f}") # → R²: 0.8058 (explains 80% of variance)
2
Unsupervised Learning
Discover hidden structure in unlabelled data

In unsupervised learning, the training data has no labels. The algorithm must find patterns, structure, or compressed representations entirely on its own — without any teacher telling it what the right answer is. It is one of the most intellectually rich areas of machine learning because the algorithm must define what "interesting structure" even means.

Real-world motivation: Labelling data is expensive, time-consuming, and requires domain expertise. Most of the world's data is unlabelled. Unsupervised learning lets you extract value from raw, untagged data at scale.

The Three Sub-Problems of Unsupervised Learning

Clustering
Partition data points into groups (clusters) such that points within a group are more similar to each other than to points in other groups. The number of groups may or may not be known in advance.
Algorithms: K-Means, DBSCAN, Hierarchical Clustering, Gaussian Mixture Models. Use cases: customer segmentation, document grouping, gene expression analysis.
Dimensionality Reduction
Compress data from a high-dimensional space into a lower-dimensional representation while preserving the most important structure. Also called representation learning.
Algorithms: PCA, t-SNE, UMAP, Autoencoders. Use cases: visualising high-dimensional data, speeding up other ML models, noise reduction.
Association Rule Learning
Discover interesting relationships (rules) between variables in large datasets. Often applied to transactional data to find items that frequently co-occur.
Algorithms: Apriori, FP-Growth, Eclat. Use cases: market basket analysis ("customers who buy bread also buy butter"), recommendation engines, web usage mining.

Unsupervised Learning: Code Example

K-Means clustering on the Iris dataset — without using any labels:

Python — Unsupervised Clustering · K-Means · Iris Dataset
1from sklearn.datasets  import load_iris
2from sklearn.cluster   import KMeans
3from sklearn.metrics   import adjusted_rand_score
4
5# Load data — but intentionally DROP the labels
6# The algorithm sees ONLY the features, never y
7iris = load_iris()
8X    = iris.data    # shape: (150, 4) — no labels!
9
10# Tell K-Means to find 3 natural groupings in the data
11kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
12cluster_labels = kmeans.fit_predict(X)
13
14# Compare clusters found vs true species labels
15ari = adjusted_rand_score(iris.target, cluster_labels)
16print(f"Adjusted Rand Index: {ari:.3f}")  # → 0.730 (without seeing any labels!)
17
18# Cluster centres in feature space
19print("Cluster centres:")
20print(kmeans.cluster_centers_.round(2))

What happened? K-Means received 150 data points with 4 features each and absolutely no class labels. It discovered 3 natural groupings that align with an Adjusted Rand Index of 0.730 against the true species — purely from the geometric structure of the data.

Dimensionality Reduction — PCA on High-Dimensional Data

Python — Unsupervised Dimensionality Reduction · PCA · Digits Dataset
1from sklearn.datasets    import load_digits
2from sklearn.decomposition import PCA
3
4# Handwritten digits: 1797 images, each is 8x8 = 64 features
5digits = load_digits()
6X = digits.data  # shape: (1797, 64)
7print(f"Original shape: {X.shape}")  # (1797, 64)
8
9# Reduce 64 dimensions → 2 for visualisation
10pca       = PCA(n_components=2)
11X_reduced = pca.fit_transform(X)
12print(f"Reduced shape:  {X_reduced.shape}")  # (1797, 2)
13
14# How much variance do 2 components explain?
15print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
16# → Variance explained: 28.5%  (64 dims compressed to 2)
17
18# Use 30 components to retain 93% of variance
19pca_30   = PCA(n_components=30)
20X_comp  = pca_30.fit_transform(X)
21print(f"Variance explained: {pca_30.explained_variance_ratio_.sum():.1%}")
22# → Variance explained: 93.8%  (53% fewer features, 93% info retained)
3
Semi-Supervised Learning
A small amount of labelled data + a large amount of unlabelled data

Semi-supervised learning sits between supervised and unsupervised learning. It uses a small pool of labelled examples to guide the learning process, combined with a large amount of unlabelled data to capture the underlying data distribution. This mirrors how humans learn — we get a few explicit lessons and infer a great deal from observation.

The Supervised–Unsupervised Spectrum

100% Labelled
Supervised
Few Labels
+ Many Unlabelled
Semi-Supervised
0% Labelled
Unsupervised

In practice, most real-world datasets fall somewhere in the middle of this spectrum.

Classic Real-World Example: Google Photos

Google Photos must identify faces across billions of images. It is impractical to label every photo manually. The system clusters faces unsupervisedly, then asks the user to label one or two clusters (e.g., "this is Mum"). The labels are then propagated across all similar unlabelled faces — semi-supervised learning in action.

Common semi-supervised techniques include self-training (use the model's own confident predictions as pseudo-labels), co-training, label propagation on graph structures, and generative models that model the joint distribution of data and labels.

Python — Semi-Supervised · Label Propagation · Iris Dataset
1import numpy as np
2from sklearn.datasets        import load_iris
3from sklearn.semi_supervised  import LabelPropagation
4from sklearn.metrics          import accuracy_score
5
6iris = load_iris()
7X, y_true = iris.data, iris.target
8
9# Simulate: label only 20 of 150 examples (13%)
10# Mark unlabelled examples with -1
11y_partial = np.full(y_true.shape, -1)  # -1 = unlabelled
12rng       = np.random.RandomState(42)
13labelled_idx = rng.choice(150, 20, replace=False)
14y_partial[labelled_idx] = y_true[labelled_idx]
15
16# Label Propagation spreads labels through the data graph
17lp = LabelPropagation(kernel='rbf', max_iter=1000)
18lp.fit(X, y_partial)
19
20# Accuracy on ALL 150 examples (including unlabelled ones)
21print(f"Accuracy: {accuracy_score(y_true, lp.transduction_):.1%}")
22# → Accuracy: 96.0% — using only 20 labelled examples!
4
Reinforcement Learning
Learn optimal behaviour through trial, error, and reward signals

Reinforcement Learning (RL) is fundamentally different from the other paradigms. There is no fixed dataset of examples. Instead, a learning agent interacts with an environment, takes actions, observes the resulting state, and receives a numerical reward signal. The goal is to learn a policy — a strategy for choosing actions — that maximises total cumulative reward over time.

Agent–Environment Interaction Loop

Agent
Learns the policy
Action (a_t)
Reward (r_t)
State (s_t+1)
Environment
Chess board, game, robot world

The agent selects an action. The environment transitions to a new state and emits a reward. This loop repeats — the agent learns to maximise total future reward.

Core RL Terminology

Term Definition Chess Example
AgentThe learner / decision-makerThe chess-playing AI
EnvironmentEverything outside the agent that it interacts withThe chess board and opponent
State (s)A complete description of the world at a given momentCurrent positions of all pieces on the board
Action (a)A decision the agent can makeMoving a piece to a specific square
Reward (r)A scalar signal indicating how good the last action was+1 for winning, -1 for losing, 0 for each move
Policy (π)The agent's strategy: a function from states to actionsThe complete set of "given this board position, play this move" rules
Value FunctionExpected cumulative future reward from a stateHow likely is this board position to lead to a win?

Landmark RL Achievements

DeepMind's AlphaGo became the first program to defeat a world champion Go player in 2016 using deep reinforcement learning. AlphaZero later mastered chess, shogi, and Go from scratch within 24 hours — with no human game data, only the rules of the game and a reward signal of win/lose.

Python — Reinforcement Learning · Q-Learning · Conceptual Skeleton
1import numpy as np
2
3# Q-Learning: tabular RL for discrete state/action spaces
4# Q[state, action] = expected future reward of taking 'action' from 'state'
5
6n_states  = 16   # 4x4 grid world
7n_actions = 4    # up, down, left, right
8Q         = np.zeros((n_states, n_actions))  # initialise Q-table
9
10alpha   = 0.1    # learning rate
11gamma   = 0.99   # discount factor (how much future rewards matter)
12epsilon = 0.1    # exploration rate (random action with prob epsilon)
13
14def q_learning_update(state, action, reward, next_state):
15    # Bellman equation: update Q towards the observed return
16    best_next_q       = np.max(Q[next_state])
17    td_target         = reward + gamma * best_next_q
18    td_error          = td_target - Q[state, action]
19    Q[state, action] += alpha * td_error  # update Q-value
20
21# The agent interacts with the environment for many episodes
22for episode in range(10_000):
23    state = env_reset()           # start fresh each episode
24    while not done:
25        # Epsilon-greedy: explore randomly or exploit Q-table
26        if np.random.rand() < epsilon:
27            action = np.random.randint(n_actions)  # explore
28        else:
29            action = np.argmax(Q[state])           # exploit
30        next_state, reward, done = env_step(action)
31        q_learning_update(state, action, reward, next_state)
32        state = next_state

Side-by-Side Comparison: All Four Learning Paradigms

Use this reference table to understand the key differences across all four training paradigms before choosing one for a real project.

Paradigm Training Data Feedback Signal Goal Real-World Use Case
Supervised Labelled pairs (X, y) Error between prediction and known label Learn a mapping from inputs to outputs Email spam filter, fraud detection, house price prediction
Unsupervised Unlabelled X only Internal — reconstruction error, cluster cohesion, etc. Discover hidden structure, compress data Customer segmentation, anomaly detection, topic modelling
Semi-Supervised Few labelled + many unlabelled Labelled error + unlabelled structure signals Leverage unlabelled data to improve a labelled model Medical image analysis, web content classification
Reinforcement No fixed dataset — agent generates data by acting Reward signal from environment Learn a policy that maximises cumulative reward Game playing, robot navigation, recommendation systems, trading bots

2
Axis 2: Classification by Learning Mode — Batch vs. Online

The second axis asks: when and how does the model update itself? Does it learn once from a fixed dataset, or does it continuously update as new data streams in?

Batch Learning (Offline Learning)

The model is trained on the entire available dataset at once, producing a fixed model that is deployed without further updates. If new data arrives, the model must be fully retrained from scratch.

  • Requires all training data to be available upfront
  • Training can take hours or days (done offline)
  • Deployed model is static — does not adapt to drift
  • To update, you retrain on the full combined dataset and re-deploy
  • Best when data does not change rapidly over time
  • Computationally expensive at training time, cheap at inference time
Examples: traditional spam filters, fraud detection models trained monthly, product recommendation engines updated weekly.
Online Learning (Incremental Learning)

The model is trained incrementally as new data arrives — either one sample at a time (pure online) or in small batches (mini-batches). The model continuously updates its parameters with each new observation.

  • Does not need to store all historical data in memory
  • Adapts quickly to changes in the data distribution (concept drift)
  • Uses a learning rate to control how fast it adapts
  • A bad data point or corrupted stream can degrade the model fast
  • Also used for datasets too large to fit in memory (out-of-core learning)
  • Suited for real-time or high-velocity data streams
Examples: stock price forecasting, real-time ad click prediction, live sensor data anomaly detection.

Learning Rate in Online Learning: The learning rate parameter controls how quickly the model adapts to new data. A high rate means the model forgets old patterns quickly — useful when data changes fast but risky if the new data is noisy. A low rate means the model is stable but slow to adapt to genuine concept drift.

Batch vs. Online: Code Comparison

Python — Batch vs. Online Learning · SGDClassifier · partial_fit()
1from sklearn.linear_model  import SGDClassifier
2from sklearn.datasets       import load_iris
3from sklearn.preprocessing  import StandardScaler
4import numpy as np
5
6X, y  = load_iris(return_X_y=True)
7X      = StandardScaler().fit_transform(X)
8
9## ─── BATCH LEARNING ─────────────────────────────────────────
10batch_clf = SGDClassifier(random_state=42)
11batch_clf.fit(X, y)         # learns from ENTIRE dataset at once
12print("Batch model trained on all 150 samples at once")
13
14## ─── ONLINE LEARNING ─────────────────────────────────────────
15online_clf  = SGDClassifier(random_state=42)
16all_classes = np.unique(y)
17
18# Simulate a data stream — learn one mini-batch at a time
19for start in range(0, len(X), 10):
20    X_batch = X[start:start+10]
21    y_batch = y[start:start+10]
22    # partial_fit() updates model without forgetting previous batches
23    online_clf.partial_fit(X_batch, y_batch, classes=all_classes)
24
25print("Online model trained on 15 mini-batches of 10 samples each")

3
Axis 3: How the System Generalises — Instance-Based vs. Model-Based

The third axis concerns how the algorithm generalises from the training examples it has seen to new, unseen examples. There are two fundamentally different philosophies.

Instance-Based Learning (Memory-Based)

The system memorises the training examples and generalises to new points by comparing them to stored instances using a similarity measure (e.g., Euclidean distance). It does not build an explicit model of the world — it learns by heart and reasons by analogy.

How it predicts: When given a new data point, find the most similar training example(s) and use their labels to make a prediction.

Advantages: Trivially adapts to new training data, no training time, naturally handles complex non-linear decision boundaries.

Disadvantages: Prediction is slow (must compare against all stored points), requires large memory, sensitive to irrelevant features and the choice of similarity measure.

Canonical algorithm: K-Nearest Neighbors (KNN). Also: Locally Weighted Regression, Case-Based Reasoning systems.
Model-Based Learning (Parametric)

The system builds an explicit mathematical model of the data — a compact set of parameters (weights, coefficients) that summarise the patterns learned. After training, the raw data can be discarded. The model is the generalisation.

How it predicts: Apply the learned mathematical function to the new input — a simple computation regardless of the training set size.

Advantages: Fast prediction, compact storage, strong interpretability (for linear models), principled framework for uncertainty.

Disadvantages: Requires choosing the right model family, may underfit if the model is too simple, parameters must be learned through an optimisation process.

Canonical algorithms: Linear Regression, Logistic Regression, Decision Trees, Neural Networks, SVMs.
Instance-Based: KNN

To classify the star below, find the K nearest labelled points and take a majority vote.

   
     
      
With K=3: 2 blue + 1 teal nearby → predict blue

No training phase. The entire training set IS the model.

Model-Based: Linear Regression

Fit a compact mathematical function to all training points. Then use that function for prediction.

Training phase:
Minimise: ∑(yi − (θ0 + θ1xi))²
Result: learned parameters
θ0 = 4.21, θ1 = 0.37
Prediction (discard training data):
ŷ = 4.21 + 0.37 × xnew

Training data can be discarded. Only the parameters are needed for inference.

Python — Instance-Based (KNN) vs. Model-Based (Linear Regression)
1from sklearn.neighbors    import KNeighborsRegressor
2from sklearn.linear_model import LinearRegression
3import numpy as np
4
5# Toy dataset: years of experience → salary
6X_train = np.array([[1],[2],[3],[5],[7],[10]])
7y_train = np.array([30,35,40,52,63,80])  # salary in $k
8X_new   = np.array([[6]])           # 6 years experience (unseen)
9
10## Instance-Based: KNN stores all training data
11knn = KNeighborsRegressor(n_neighbors=2)
12knn.fit(X_train, y_train)        # "training" = memorise data
13print(f"KNN prediction for 6 yrs: ${{knn.predict(X_new)[0]:.1f}}k")
14# → KNN prediction: $57.5k  (average of 2 nearest: $52k and $63k)
15
16## Model-Based: Linear Regression fits a parametric model
17lr = LinearRegression()
18lr.fit(X_train, y_train)         # training = learn parameters θ₀, θ₁
19print(f"Linear model: ŷ = {lr.intercept_:.1f} + {lr.coef_[0]:.1f}·x")
20# → Linear model: ŷ = 22.8 + 5.5·x
21print(f"LR  prediction for 6 yrs: ${{lr.predict(X_new)[0]:.1f}}k")
22# → LR  prediction for 6 yrs: $55.8k

How to Choose: A Practical Decision Guide

In practice, the right type of ML system depends on the nature of your data, the problem constraints, and the operational requirements. Use this guide to narrow your choice.

Situation Recommended Approach Reasoning
You have a labelled dataset and a clear input-output mapping to learn Supervised Learning Labels provide the learning signal. Choose classification or regression based on output type.
You have abundant data but no labels — or labelling is prohibitively expensive Unsupervised Learning Let the algorithm discover structure. Cluster first, then label cluster representatives if needed.
You have a small labelled set (1–10%) but a large unlabelled pool Semi-Supervised Use label propagation or self-training to leverage the unlabelled data and boost performance.
You are optimising sequential decisions in a dynamic environment with a reward signal Reinforcement Learning No fixed dataset exists. The agent must explore and learn from environmental feedback.
Data arrives as a continuous stream and the distribution may shift over time Online Learning Batch learning cannot adapt to concept drift without expensive full retraining.
You need very fast predictions and interpretable parameters at inference time Model-Based The trained parameters encode the model compactly. Inference is a simple arithmetic operation.
Your data has a highly irregular, non-parametric structure with no good model family Instance-Based (KNN) KNN makes no assumptions about the functional form. It adapts naturally to any decision boundary shape.

Putting It All Together: Classifying a Real ML System

Let us classify a concrete system — a spam filter built with Logistic Regression and retrained monthly — across all three axes simultaneously.

Example: Email Spam Filter — Classified on All Three Axes
1
Training Signal
Supervised Learning. Every email in the training set is labelled as "spam" or "not spam". The model learns from the error between its predictions and these known labels.
2
Learning Mode
Batch Learning. The model is retrained from scratch every month on all accumulated email data. It does not update continuously — it produces a new static model each retraining cycle.
3
Generalisation
Model-Based Learning. Logistic Regression fits a parametric model — a set of feature weights. Training data is discarded after training. Inference is a fast dot product followed by a sigmoid function.

Summary label for this system: Supervised + Batch + Model-Based. This is the most common combination for production ML systems that deal with well-structured, labelled datasets updated on a schedule.

Key Takeaways

  • Every ML system can be classified on three independent axes: training signal, learning mode, and generalisation strategy.
  • Supervised learning requires labelled data and learns a mapping from inputs to outputs — it powers classification and regression.
  • Unsupervised learning finds hidden structure in unlabelled data through clustering, dimensionality reduction, and association rule learning.
  • Semi-supervised learning bridges the gap — it uses a small labelled set plus a large unlabelled pool, dramatically reducing labelling costs.
  • Reinforcement learning is not data-driven in the traditional sense — an agent learns a policy by interacting with an environment and maximising cumulative reward.
  • Batch learning trains once on a full dataset; online learning updates incrementally and can adapt to concept drift in real-time streams.
  • Instance-based models generalise by similarity to stored examples; model-based systems fit a compact parametric function and discard the training data after training.

What is Next?

In Chapter 1.3 — Main Challenges of Machine Learning, we explore the most common failure modes that practitioners encounter in the real world: insufficient training data, poor data quality, overfitting, underfitting, data mismatch, and the train/serve skew problem — all with practical mitigation strategies and code examples.