Types of Machine Learning Systems
Not all machine learning systems learn the same way. They differ by how they are trained (supervised, unsupervised, reinforcement), how they consume data (batch vs. online), and how they generalise (instance-based vs. model-based). Understanding these axes is the foundation for choosing the right algorithm for any real-world problem.
Three Ways to Classify Any ML System
Machine learning systems can be classified along three independent axes. Each axis asks a different question about how the system works.
Important: These axes are independent. A single ML system can be supervised AND batch AND model-based. Or online AND unsupervised AND instance-based. You classify a system on all three axes simultaneously.
The most fundamental way to classify ML systems is by the type of feedback the algorithm uses to learn. There are four major paradigms here: Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, and Reinforcement Learning.
In supervised learning, every training example comes with a label — the correct answer the algorithm should produce. The model learns a mapping from inputs to outputs by minimising the difference between its predictions and the known labels. It is called "supervised" because the labels act like a teacher guiding the learning process.
Two Sub-Types of Supervised Learning
Quick rule: If the answer is a category (yes/no, A/B/C), use classification. If the answer is a number on a continuous scale, use regression. Some algorithms like decision trees and random forests can handle both.
Common Supervised Learning Algorithms
Supervised Learning: Code Example
Here is a complete example demonstrating both classification and regression using scikit-learn's built-in datasets:
Classification — Iris Flower Dataset
1from sklearn.datasets import load_iris 2from sklearn.model_selection import train_test_split 3from sklearn.svm import SVC 4from sklearn.metrics import classification_report 5 6# Load labelled dataset — 150 flowers, 4 features, 3 species 7iris = load_iris() 8X, y = iris.data, iris.target # labels: 0=setosa, 1=versicolor, 2=virginica 9 10# Split: 80% training, 20% testing 11X_train, X_test, y_train, y_test = train_test_split( 12 X, y, test_size=0.2, random_state=42 13) 14 15# Train a Support Vector Classifier on labelled data 16clf = SVC(kernel='rbf', C=1.0) 17clf.fit(X_train, y_train) 18 19# Evaluate on unseen test data 20y_pred = clf.predict(X_test) 21print(classification_report(y_test, y_pred, target_names=iris.target_names)) 22# precision recall f1-score support 23# setosa 1.00 1.00 1.00 10 24# versicolor 1.00 1.00 1.00 9 25# virginica 1.00 1.00 1.00 11
Regression — California Housing Dataset
1from sklearn.datasets import fetch_california_housing 2from sklearn.model_selection import train_test_split 3from sklearn.ensemble import RandomForestRegressor 4from sklearn.metrics import mean_squared_error, r2_score 5import numpy as np 6 7# Target: median house price (a continuous number — regression task) 8housing = fetch_california_housing() 9X_train, X_test, y_train, y_test = train_test_split( 10 housing.data, housing.target, test_size=0.2, random_state=42 11) 12 13reg = RandomForestRegressor(n_estimators=100, random_state=42) 14reg.fit(X_train, y_train) 15 16y_pred = reg.predict(X_test) 17rmse = np.sqrt(mean_squared_error(y_test, y_pred)) 18print(f"RMSE: {rmse:.4f}") # → RMSE: 0.5035 (units: $100k) 19print(f"R²: {r2_score(y_test, y_pred):.4f}") # → R²: 0.8058 (explains 80% of variance)
In unsupervised learning, the training data has no labels. The algorithm must find patterns, structure, or compressed representations entirely on its own — without any teacher telling it what the right answer is. It is one of the most intellectually rich areas of machine learning because the algorithm must define what "interesting structure" even means.
Real-world motivation: Labelling data is expensive, time-consuming, and requires domain expertise. Most of the world's data is unlabelled. Unsupervised learning lets you extract value from raw, untagged data at scale.
The Three Sub-Problems of Unsupervised Learning
Unsupervised Learning: Code Example
K-Means clustering on the Iris dataset — without using any labels:
1from sklearn.datasets import load_iris 2from sklearn.cluster import KMeans 3from sklearn.metrics import adjusted_rand_score 4 5# Load data — but intentionally DROP the labels 6# The algorithm sees ONLY the features, never y 7iris = load_iris() 8X = iris.data # shape: (150, 4) — no labels! 9 10# Tell K-Means to find 3 natural groupings in the data 11kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto') 12cluster_labels = kmeans.fit_predict(X) 13 14# Compare clusters found vs true species labels 15ari = adjusted_rand_score(iris.target, cluster_labels) 16print(f"Adjusted Rand Index: {ari:.3f}") # → 0.730 (without seeing any labels!) 17 18# Cluster centres in feature space 19print("Cluster centres:") 20print(kmeans.cluster_centers_.round(2))
What happened? K-Means received 150 data points with 4 features each and absolutely no class labels. It discovered 3 natural groupings that align with an Adjusted Rand Index of 0.730 against the true species — purely from the geometric structure of the data.
Dimensionality Reduction — PCA on High-Dimensional Data
1from sklearn.datasets import load_digits 2from sklearn.decomposition import PCA 3 4# Handwritten digits: 1797 images, each is 8x8 = 64 features 5digits = load_digits() 6X = digits.data # shape: (1797, 64) 7print(f"Original shape: {X.shape}") # (1797, 64) 8 9# Reduce 64 dimensions → 2 for visualisation 10pca = PCA(n_components=2) 11X_reduced = pca.fit_transform(X) 12print(f"Reduced shape: {X_reduced.shape}") # (1797, 2) 13 14# How much variance do 2 components explain? 15print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}") 16# → Variance explained: 28.5% (64 dims compressed to 2) 17 18# Use 30 components to retain 93% of variance 19pca_30 = PCA(n_components=30) 20X_comp = pca_30.fit_transform(X) 21print(f"Variance explained: {pca_30.explained_variance_ratio_.sum():.1%}") 22# → Variance explained: 93.8% (53% fewer features, 93% info retained)
Semi-supervised learning sits between supervised and unsupervised learning. It uses a small pool of labelled examples to guide the learning process, combined with a large amount of unlabelled data to capture the underlying data distribution. This mirrors how humans learn — we get a few explicit lessons and infer a great deal from observation.
The Supervised–Unsupervised Spectrum
Supervised
+ Many Unlabelled
Semi-Supervised
Unsupervised
In practice, most real-world datasets fall somewhere in the middle of this spectrum.
Classic Real-World Example: Google Photos
Google Photos must identify faces across billions of images. It is impractical to label every photo manually. The system clusters faces unsupervisedly, then asks the user to label one or two clusters (e.g., "this is Mum"). The labels are then propagated across all similar unlabelled faces — semi-supervised learning in action.
Common semi-supervised techniques include self-training (use the model's own confident predictions as pseudo-labels), co-training, label propagation on graph structures, and generative models that model the joint distribution of data and labels.
1import numpy as np 2from sklearn.datasets import load_iris 3from sklearn.semi_supervised import LabelPropagation 4from sklearn.metrics import accuracy_score 5 6iris = load_iris() 7X, y_true = iris.data, iris.target 8 9# Simulate: label only 20 of 150 examples (13%) 10# Mark unlabelled examples with -1 11y_partial = np.full(y_true.shape, -1) # -1 = unlabelled 12rng = np.random.RandomState(42) 13labelled_idx = rng.choice(150, 20, replace=False) 14y_partial[labelled_idx] = y_true[labelled_idx] 15 16# Label Propagation spreads labels through the data graph 17lp = LabelPropagation(kernel='rbf', max_iter=1000) 18lp.fit(X, y_partial) 19 20# Accuracy on ALL 150 examples (including unlabelled ones) 21print(f"Accuracy: {accuracy_score(y_true, lp.transduction_):.1%}") 22# → Accuracy: 96.0% — using only 20 labelled examples!
Reinforcement Learning (RL) is fundamentally different from the other paradigms. There is no fixed dataset of examples. Instead, a learning agent interacts with an environment, takes actions, observes the resulting state, and receives a numerical reward signal. The goal is to learn a policy — a strategy for choosing actions — that maximises total cumulative reward over time.
Agent–Environment Interaction Loop
The agent selects an action. The environment transitions to a new state and emits a reward. This loop repeats — the agent learns to maximise total future reward.
Core RL Terminology
| Term | Definition | Chess Example |
|---|---|---|
| Agent | The learner / decision-maker | The chess-playing AI |
| Environment | Everything outside the agent that it interacts with | The chess board and opponent |
| State (s) | A complete description of the world at a given moment | Current positions of all pieces on the board |
| Action (a) | A decision the agent can make | Moving a piece to a specific square |
| Reward (r) | A scalar signal indicating how good the last action was | +1 for winning, -1 for losing, 0 for each move |
| Policy (π) | The agent's strategy: a function from states to actions | The complete set of "given this board position, play this move" rules |
| Value Function | Expected cumulative future reward from a state | How likely is this board position to lead to a win? |
Landmark RL Achievements
DeepMind's AlphaGo became the first program to defeat a world champion Go player in 2016 using deep reinforcement learning. AlphaZero later mastered chess, shogi, and Go from scratch within 24 hours — with no human game data, only the rules of the game and a reward signal of win/lose.
1import numpy as np 2 3# Q-Learning: tabular RL for discrete state/action spaces 4# Q[state, action] = expected future reward of taking 'action' from 'state' 5 6n_states = 16 # 4x4 grid world 7n_actions = 4 # up, down, left, right 8Q = np.zeros((n_states, n_actions)) # initialise Q-table 9 10alpha = 0.1 # learning rate 11gamma = 0.99 # discount factor (how much future rewards matter) 12epsilon = 0.1 # exploration rate (random action with prob epsilon) 13 14def q_learning_update(state, action, reward, next_state): 15 # Bellman equation: update Q towards the observed return 16 best_next_q = np.max(Q[next_state]) 17 td_target = reward + gamma * best_next_q 18 td_error = td_target - Q[state, action] 19 Q[state, action] += alpha * td_error # update Q-value 20 21# The agent interacts with the environment for many episodes 22for episode in range(10_000): 23 state = env_reset() # start fresh each episode 24 while not done: 25 # Epsilon-greedy: explore randomly or exploit Q-table 26 if np.random.rand() < epsilon: 27 action = np.random.randint(n_actions) # explore 28 else: 29 action = np.argmax(Q[state]) # exploit 30 next_state, reward, done = env_step(action) 31 q_learning_update(state, action, reward, next_state) 32 state = next_state
Side-by-Side Comparison: All Four Learning Paradigms
Use this reference table to understand the key differences across all four training paradigms before choosing one for a real project.
| Paradigm | Training Data | Feedback Signal | Goal | Real-World Use Case |
|---|---|---|---|---|
| Supervised | Labelled pairs (X, y) | Error between prediction and known label | Learn a mapping from inputs to outputs | Email spam filter, fraud detection, house price prediction |
| Unsupervised | Unlabelled X only | Internal — reconstruction error, cluster cohesion, etc. | Discover hidden structure, compress data | Customer segmentation, anomaly detection, topic modelling |
| Semi-Supervised | Few labelled + many unlabelled | Labelled error + unlabelled structure signals | Leverage unlabelled data to improve a labelled model | Medical image analysis, web content classification |
| Reinforcement | No fixed dataset — agent generates data by acting | Reward signal from environment | Learn a policy that maximises cumulative reward | Game playing, robot navigation, recommendation systems, trading bots |
The second axis asks: when and how does the model update itself? Does it learn once from a fixed dataset, or does it continuously update as new data streams in?
The model is trained on the entire available dataset at once, producing a fixed model that is deployed without further updates. If new data arrives, the model must be fully retrained from scratch.
- Requires all training data to be available upfront
- Training can take hours or days (done offline)
- Deployed model is static — does not adapt to drift
- To update, you retrain on the full combined dataset and re-deploy
- Best when data does not change rapidly over time
- Computationally expensive at training time, cheap at inference time
The model is trained incrementally as new data arrives — either one sample at a time (pure online) or in small batches (mini-batches). The model continuously updates its parameters with each new observation.
- Does not need to store all historical data in memory
- Adapts quickly to changes in the data distribution (concept drift)
- Uses a learning rate to control how fast it adapts
- A bad data point or corrupted stream can degrade the model fast
- Also used for datasets too large to fit in memory (out-of-core learning)
- Suited for real-time or high-velocity data streams
Learning Rate in Online Learning: The learning rate parameter controls how quickly the model adapts to new data. A high rate means the model forgets old patterns quickly — useful when data changes fast but risky if the new data is noisy. A low rate means the model is stable but slow to adapt to genuine concept drift.
Batch vs. Online: Code Comparison
1from sklearn.linear_model import SGDClassifier 2from sklearn.datasets import load_iris 3from sklearn.preprocessing import StandardScaler 4import numpy as np 5 6X, y = load_iris(return_X_y=True) 7X = StandardScaler().fit_transform(X) 8 9## ─── BATCH LEARNING ───────────────────────────────────────── 10batch_clf = SGDClassifier(random_state=42) 11batch_clf.fit(X, y) # learns from ENTIRE dataset at once 12print("Batch model trained on all 150 samples at once") 13 14## ─── ONLINE LEARNING ───────────────────────────────────────── 15online_clf = SGDClassifier(random_state=42) 16all_classes = np.unique(y) 17 18# Simulate a data stream — learn one mini-batch at a time 19for start in range(0, len(X), 10): 20 X_batch = X[start:start+10] 21 y_batch = y[start:start+10] 22 # partial_fit() updates model without forgetting previous batches 23 online_clf.partial_fit(X_batch, y_batch, classes=all_classes) 24 25print("Online model trained on 15 mini-batches of 10 samples each")
The third axis concerns how the algorithm generalises from the training examples it has seen to new, unseen examples. There are two fundamentally different philosophies.
The system memorises the training examples and generalises to new points by comparing them to stored instances using a similarity measure (e.g., Euclidean distance). It does not build an explicit model of the world — it learns by heart and reasons by analogy.
How it predicts: When given a new data point, find the most similar training example(s) and use their labels to make a prediction.
Advantages: Trivially adapts to new training data, no training time, naturally handles complex non-linear decision boundaries.
Disadvantages: Prediction is slow (must compare against all stored points), requires large memory, sensitive to irrelevant features and the choice of similarity measure.
The system builds an explicit mathematical model of the data — a compact set of parameters (weights, coefficients) that summarise the patterns learned. After training, the raw data can be discarded. The model is the generalisation.
How it predicts: Apply the learned mathematical function to the new input — a simple computation regardless of the training set size.
Advantages: Fast prediction, compact storage, strong interpretability (for linear models), principled framework for uncertainty.
Disadvantages: Requires choosing the right model family, may underfit if the model is too simple, parameters must be learned through an optimisation process.
To classify the star below, find the K nearest labelled points and take a majority vote.
★ ●
● ● ●
No training phase. The entire training set IS the model.
Fit a compact mathematical function to all training points. Then use that function for prediction.
Training data can be discarded. Only the parameters are needed for inference.
1from sklearn.neighbors import KNeighborsRegressor 2from sklearn.linear_model import LinearRegression 3import numpy as np 4 5# Toy dataset: years of experience → salary 6X_train = np.array([[1],[2],[3],[5],[7],[10]]) 7y_train = np.array([30,35,40,52,63,80]) # salary in $k 8X_new = np.array([[6]]) # 6 years experience (unseen) 9 10## Instance-Based: KNN stores all training data 11knn = KNeighborsRegressor(n_neighbors=2) 12knn.fit(X_train, y_train) # "training" = memorise data 13print(f"KNN prediction for 6 yrs: ${{knn.predict(X_new)[0]:.1f}}k") 14# → KNN prediction: $57.5k (average of 2 nearest: $52k and $63k) 15 16## Model-Based: Linear Regression fits a parametric model 17lr = LinearRegression() 18lr.fit(X_train, y_train) # training = learn parameters θ₀, θ₁ 19print(f"Linear model: ŷ = {lr.intercept_:.1f} + {lr.coef_[0]:.1f}·x") 20# → Linear model: ŷ = 22.8 + 5.5·x 21print(f"LR prediction for 6 yrs: ${{lr.predict(X_new)[0]:.1f}}k") 22# → LR prediction for 6 yrs: $55.8k
How to Choose: A Practical Decision Guide
In practice, the right type of ML system depends on the nature of your data, the problem constraints, and the operational requirements. Use this guide to narrow your choice.
| Situation | Recommended Approach | Reasoning |
|---|---|---|
| You have a labelled dataset and a clear input-output mapping to learn | Supervised Learning | Labels provide the learning signal. Choose classification or regression based on output type. |
| You have abundant data but no labels — or labelling is prohibitively expensive | Unsupervised Learning | Let the algorithm discover structure. Cluster first, then label cluster representatives if needed. |
| You have a small labelled set (1–10%) but a large unlabelled pool | Semi-Supervised | Use label propagation or self-training to leverage the unlabelled data and boost performance. |
| You are optimising sequential decisions in a dynamic environment with a reward signal | Reinforcement Learning | No fixed dataset exists. The agent must explore and learn from environmental feedback. |
| Data arrives as a continuous stream and the distribution may shift over time | Online Learning | Batch learning cannot adapt to concept drift without expensive full retraining. |
| You need very fast predictions and interpretable parameters at inference time | Model-Based | The trained parameters encode the model compactly. Inference is a simple arithmetic operation. |
| Your data has a highly irregular, non-parametric structure with no good model family | Instance-Based (KNN) | KNN makes no assumptions about the functional form. It adapts naturally to any decision boundary shape. |
Putting It All Together: Classifying a Real ML System
Let us classify a concrete system — a spam filter built with Logistic Regression and retrained monthly — across all three axes simultaneously.
Summary label for this system: Supervised + Batch + Model-Based. This is the most common combination for production ML systems that deal with well-structured, labelled datasets updated on a schedule.
Key Takeaways
- Every ML system can be classified on three independent axes: training signal, learning mode, and generalisation strategy.
- Supervised learning requires labelled data and learns a mapping from inputs to outputs — it powers classification and regression.
- Unsupervised learning finds hidden structure in unlabelled data through clustering, dimensionality reduction, and association rule learning.
- Semi-supervised learning bridges the gap — it uses a small labelled set plus a large unlabelled pool, dramatically reducing labelling costs.
- Reinforcement learning is not data-driven in the traditional sense — an agent learns a policy by interacting with an environment and maximising cumulative reward.
- Batch learning trains once on a full dataset; online learning updates incrementally and can adapt to concept drift in real-time streams.
- Instance-based models generalise by similarity to stored examples; model-based systems fit a compact parametric function and discard the training data after training.
What is Next?
In Chapter 1.3 — Main Challenges of Machine Learning, we explore the most common failure modes that practitioners encounter in the real world: insufficient training data, poor data quality, overfitting, underfitting, data mismatch, and the train/serve skew problem — all with practical mitigation strategies and code examples.