Testing and Validation in Machine Learning
Training a machine learning model is only the beginning. The real question is: will it work on data it has never seen before? Testing and validation are the scientific processes by which we answer that question honestly — without fooling ourselves. A model that performs brilliantly during training but fails in production is worthless. This chapter teaches you how to evaluate models correctly, avoid the most dangerous pitfalls, and select the best model with confidence.
Why Proper Testing and Validation Matter
Consider a student who studies for an exam using only the exact questions that will appear on the test. They score 100%. But when they sit a different exam on the same subject, they fail. This is the core problem ML practitioners call overfitting, and it is the single most common reason deployed models underperform in production.
Proper testing and validation give you a reliable, unbiased estimate of how a model will perform on fresh, real-world data — before you ever deploy it.
Generalisation: The Core Goal
The ultimate goal of any supervised learning model is generalisation — the ability to make accurate predictions on previously unseen data drawn from the same distribution as the training data. A model can fail to generalise in two opposite ways: by being too simple (underfitting) or too complex (overfitting).
The generalisation gap is the difference between training error and test error. A large gap is the primary signal of overfitting. Your job during validation is to measure this gap and close it.
The Train / Validation / Test Split
The foundational technique for honest model evaluation is partitioning your dataset into separate subsets, each serving a distinct and non-overlapping purpose. Never use the same data for training and for evaluation.
Typical 60 / 20 / 20 Three-Way Split
Common alternative splits: 70/15/15 or 80/10/10. The right ratio depends on dataset size.
Critical rule: If you evaluate your model on the test set and then make changes based on those results, the test set is no longer a "fresh" evaluation. It has effectively become part of your model development loop, and your final accuracy figures will be optimistically biased. The test set must be locked away until the very end.
Implementing a Train/Test Split with scikit-learn:
1from sklearn.datasets import load_breast_cancer 2from sklearn.model_selection import train_test_split 3 4data = load_breast_cancer() 5X, y = data.data, data.target 6 7# Step 1: Reserve 20% as the untouchable test set 8X_trainval, X_test, y_trainval, y_test = train_test_split( 9 X, y, test_size=0.20, random_state=42, stratify=y 10) 11 12# Step 2: Split the remaining 80% into 75% train, 25% validation 13# Result: 60% train, 20% val, 20% test of the original dataset 14X_train, X_val, y_train, y_val = train_test_split( 15 X_trainval, y_trainval, test_size=0.25, random_state=42, stratify=y_trainval 16) 17 18print(f"Train: {X_train.shape[0]} samples") # → 341 19print(f"Validation: {X_val.shape[0]} samples") # → 114 20print(f"Test: {X_test.shape[0]} samples") # → 114 21 22# Notice: stratify=y ensures class proportions are maintained in every split 23# This is crucial for imbalanced classification datasets
Why stratify=y? Without stratification, a random split might place nearly all examples of a rare class into the training set, leaving the validation set with almost none. Stratified splitting preserves the original class distribution in every subset, which is essential for reliable evaluation on imbalanced data.
Cross-Validation: A More Robust Estimate
A single train/validation split has a serious weakness: the result depends heavily on which data points happened to land in which split — a matter of luck. With a small dataset, this randomness can cause wild swings in the reported performance. Cross-validation solves this by repeating the evaluation multiple times with different splits and averaging the results.
K-Fold Cross-Validation (K = 5)
Final CV score = Mean(Score 1, Score 2, Score 3, Score 4, Score 5)
In 5-fold cross-validation, the dataset is split into 5 equal parts. The model is trained 5 times, each time using 4 folds for training and the remaining fold as the validation set. Every data point gets used for validation exactly once. The final performance estimate is the average across all 5 folds, which is far more reliable than a single split.
1from sklearn.datasets import load_breast_cancer 2from sklearn.model_selection import cross_val_score, StratifiedKFold 3from sklearn.ensemble import RandomForestClassifier 4import numpy as np 5 6X, y = load_breast_cancer(return_X_y=True) 7 8# Stratified K-Fold preserves class balance in every fold 9cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 10 11model = RandomForestClassifier(n_estimators=100, random_state=42) 12 13# cross_val_score trains and evaluates the model K times automatically 14scores = cross_val_score( 15 model, X, y, 16 cv=cv_strategy, 17 scoring='accuracy' 18) 19 20print("Individual fold scores:", np.round(scores, 4)) 21# → [0.9649, 0.9737, 0.9561, 0.9649, 0.9561] 22 23print(f"Mean CV Accuracy: {scores.mean():.2%}") 24# → Mean CV Accuracy: 96.31% 25 26print(f"Std Dev: {scores.std():.4f}") 27# → Std Dev: 0.0065 (small = consistent, not lucky)
How to read CV results: A mean accuracy of 96.31% with a standard deviation of 0.65% tells you the model is consistently good across different data subsets — not just lucky on one particular split. A high standard deviation (e.g., 10%) would be a red flag indicating unstable performance.
Types of Cross-Validation
K-fold is the most common form, but different problem types call for different CV strategies. Choosing the wrong one can produce misleading scores.
| Technique | How It Works | Best Used When | Drawback |
|---|---|---|---|
| K-Fold CV | Data split into K equal folds; model trained K times, each fold used as validation once | Large datasets, balanced classes, general use | Can produce unbalanced folds with imbalanced classes |
| Stratified K-Fold | Same as K-Fold but each fold preserves the original class proportion | Classification with imbalanced classes — default recommendation | Only applicable to classification tasks |
| Leave-One-Out (LOOCV) | K = N (number of samples); every single point is used as a validation set once | Very small datasets where every sample matters | Extremely slow on large datasets; high variance in score estimate |
| Repeated K-Fold | K-Fold repeated R times with different random shuffles; R x K total scores are averaged | Small-to-medium datasets requiring very stable estimates | Computationally expensive (e.g., 5x10 = 50 model fits) |
| Time Series Split | Training window grows forward in time; validation is always in the future relative to training | Time series data where future cannot be used to predict the past | Early folds have very little training data |
| Group K-Fold | Ensures all data from the same group (e.g., same patient) stays in the same fold | Medical, user behaviour data where grouped samples must not be split | Fold sizes can be unequal depending on group sizes |
Time Series Cross-Validation (walk-forward validation):
1from sklearn.model_selection import TimeSeriesSplit, cross_val_score 2from sklearn.ensemble import GradientBoostingRegressor 3import numpy as np 4 5# TimeSeriesSplit never allows future data to "leak" into training 6# Each split: training = all past data, validation = next window 7tscv = TimeSeriesSplit(n_splits=5) 8 9# Visualise the splits 10for fold, (train_idx, val_idx) in enumerate(tscv.split(X), 1): 11 print(f"Fold {fold}: Train indices {train_idx[0]}-{train_idx[-1]}" 12 f" Validate on {val_idx[0]}-{val_idx[-1]}") 13# Fold 1: Train 0-99 → Validate 100-199 14# Fold 2: Train 0-199 → Validate 200-299 15# Fold 3: Train 0-299 → Validate 300-399 16# Fold 4: Train 0-399 → Validate 400-499 17# Fold 5: Train 0-499 → Validate 500-599 18 19# Run cross-validation with the time-aware strategy 20model = GradientBoostingRegressor() 21scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_absolute_error')
Nested Cross-Validation: Tuning and Evaluation Together
A subtle and dangerous bias arises when you use the same cross-validation loop to both select hyperparameters and report the final performance. If you search 100 hyperparameter combinations and pick the best one using CV, you are effectively "overfitting to the validation set." The reported score will be optimistic.
Nested cross-validation uses two independent loops: an outer loop for unbiased performance estimation and an inner loop for hyperparameter search. This is the gold standard when both tuning and evaluation must happen on the same dataset.
Nested Cross-Validation Structure
Test
1from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold 2from sklearn.svm import SVC 3from sklearn.datasets import load_breast_cancer 4import numpy as np 5 6X, y = load_breast_cancer(return_X_y=True) 7 8# Define inner loop: hyperparameter search space 9param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto'], 'kernel': ['rbf', 'linear']} 10 11inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1) 12outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 13 14# GridSearchCV = inner loop (finds best hyperparams for each outer fold) 15inner_search = GridSearchCV( 16 SVC(), param_grid, cv=inner_cv, scoring='accuracy' 17) 18 19# cross_val_score = outer loop (provides unbiased performance estimate) 20nested_scores = cross_val_score( 21 inner_search, X, y, cv=outer_cv, scoring='accuracy' 22) 23 24print(f"Nested CV scores: {np.round(nested_scores, 4)}") 25print(f"Mean (unbiased): {nested_scores.mean():.2%}") 26# → Mean (unbiased): 97.36%
The Bias-Variance Tradeoff
Every model's prediction error can be mathematically decomposed into three parts: bias, variance, and irreducible noise. Understanding this decomposition is what separates engineers who debug models systematically from those who guess randomly.
| Component | Definition | Symptom | Fix |
|---|---|---|---|
| Bias | Error from the model's wrong assumptions about the data. A biased model consistently misses the target in the same direction. | High training error AND high test error | Use a more complex model, add more features, reduce regularisation |
| Variance | Error from the model's excessive sensitivity to the training data. A high-variance model is unstable — different training sets produce very different models. | Low training error, much higher test error (large generalisation gap) | Use more training data, use a simpler model, add regularisation, use ensemble methods |
| Irreducible Noise | The inherent randomness in the data itself — measurement errors, unmeasured confounders. No model can eliminate this. | Error floor that cannot be crossed regardless of model complexity | Improve data collection, better sensors, remove noise at source |
Total Expected Error Decomposition
Error(x) = Bias2 + Variance + Irreducible Noise
As model complexity increases: Bias decreases, Variance increases. The optimal point minimises the sum of both.
Diagnosing with Learning Curves
A learning curve plots the training score and validation score as a function of the number of training samples. It is the most powerful single tool for diagnosing whether your model is suffering from high bias, high variance, or is well-calibrated.
Both training and validation scores converge to a low value. Adding more data does not help because the model is too simple.
What to do: Increase model complexity, add features, reduce regularisation, try a different algorithm.
Training score is high but validation score is significantly lower. A large gap persists even as more data is added, though it narrows.
What to do: Add more training data, simplify model, increase regularisation, use dropout (neural networks), use ensembles (bagging).
1import numpy as np 2import matplotlib.pyplot as plt 3from sklearn.model_selection import learning_curve, StratifiedKFold 4from sklearn.ensemble import RandomForestClassifier 5from sklearn.datasets import load_breast_cancer 6 7X, y = load_breast_cancer(return_X_y=True) 8model = RandomForestClassifier(n_estimators=100, random_state=42) 9 10# Compute scores at increasing training set sizes 11train_sizes, train_scores, val_scores = learning_curve( 12 model, X, y, 13 train_sizes=np.linspace(0.1, 1.0, 10), 14 cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42), 15 scoring='accuracy', 16 n_jobs=-1 # use all CPU cores 17) 18 19# Mean and standard deviation across folds 20train_mean = train_scores.mean(axis=1) 21val_mean = val_scores.mean(axis=1) 22train_std = train_scores.std(axis=1) 23val_std = val_scores.std(axis=1) 24 25# Plot 26fig, ax = plt.subplots(figsize=(8, 5)) 27ax.plot(train_sizes, train_mean, 'b-', label='Training score') 28ax.plot(train_sizes, val_mean, 'g-', label='Validation score') 29ax.fill_between(train_sizes, train_mean-train_std, train_mean+train_std, alpha=0.1, color='blue') 30ax.fill_between(train_sizes, val_mean-val_std, val_mean+val_std, alpha=0.1, color='green') 31ax.set_xlabel('Training Set Size'); ax.set_ylabel('Accuracy') 32ax.legend(); plt.tight_layout(); plt.show()
Choosing the Right Evaluation Metric
Accuracy alone is one of the most dangerous metrics to rely on. On a dataset where 99% of emails are legitimate, a classifier that labels everything as "not spam" achieves 99% accuracy while being completely useless. The right metric depends on the task type and the cost of different errors.
| Task Type | Metric | When to Use It | Beware When |
|---|---|---|---|
| Classification | Accuracy | Classes are balanced and all errors are equally costly | Classes are imbalanced — will be misleadingly high |
| Classification | Precision | False positives are costly (e.g., spam filter blocking legitimate emails) | You also care about missing true positives (recall) |
| Classification | Recall (Sensitivity) | False negatives are costly (e.g., cancer screening missing a tumour) | You also care about false alarms (precision) |
| Classification | F1 Score | Imbalanced classes; both precision and recall matter equally | Asymmetric cost — use F-beta instead to weight precision vs recall |
| Classification | ROC-AUC | Comparing classifiers across all classification thresholds | Severely imbalanced datasets — use Precision-Recall AUC instead |
| Regression | MAE | Outliers exist; you want an interpretable average error in original units | You want to heavily penalise large errors |
| Regression | RMSE | Large errors should be penalised more than small ones | Your dataset has many outliers — they will dominate RMSE |
| Regression | R-squared | Communicating "how much variance the model explains" to non-technical audiences | Comparing models trained on different datasets — R² is not comparable across datasets |
1from sklearn.metrics import ( 2 accuracy_score, precision_score, recall_score, 3 f1_score, roc_auc_score, classification_report, 4 confusion_matrix, ConfusionMatrixDisplay 5) 6from sklearn.datasets import load_breast_cancer 7from sklearn.model_selection import train_test_split 8from sklearn.ensemble import RandomForestClassifier 9 10X, y = load_breast_cancer(return_X_y=True) 11X_train, X_test, y_train, y_test = train_test_split( 12 X, y, test_size=0.2, random_state=42, stratify=y 13) 14 15model = RandomForestClassifier(n_estimators=100, random_state=42) 16model.fit(X_train, y_train) 17 18y_pred = model.predict(X_test) 19y_pred_prob = model.predict_proba(X_test)[:, 1] # class 1 probabilities for ROC 20 21print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") # → 0.9649 22print(f"Precision: {precision_score(y_test, y_pred):.4f}") # → 0.9714 23print(f"Recall: {recall_score(y_test, y_pred):.4f}") # → 0.9714 24print(f"F1 Score: {f1_score(y_test, y_pred):.4f}") # → 0.9714 25print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_prob):.4f}") # → 0.9944 26 27# Full report: per-class breakdown 28print(classification_report(y_test, y_pred, target_names=data.target_names)) 29 30# Confusion matrix 31ConfusionMatrixDisplay.from_predictions(y_test, y_pred).plot()
Results at a Glance:
Data Leakage: The Silent Killer of ML Projects
Data leakage is the single most dangerous mistake in machine learning evaluation. It occurs when information from outside the training dataset is inadvertently used to build the model, resulting in an overly optimistic performance estimate that completely falls apart in production.
The Correct Pattern — Using a Pipeline to Prevent Leakage:
1from sklearn.pipeline import Pipeline 2from sklearn.preprocessing import StandardScaler 3from sklearn.svm import SVC 4from sklearn.model_selection import cross_val_score, StratifiedKFold 5from sklearn.datasets import load_breast_cancer 6 7X, y = load_breast_cancer(return_X_y=True) 8 9# WRONG — leaky approach 10scaler = StandardScaler() 11X_scaled = scaler.fit_transform(X) # scaled on ALL data including future test folds 12# cross_val_score(SVC(), X_scaled, y, cv=5) ← this leaks test set info! 13 14# CORRECT — leak-free approach using Pipeline 15# Pipeline ensures scaler is fit ONLY on the training fold in each CV iteration 16pipeline = Pipeline([ 17 ('scaler', StandardScaler()), # fit on train fold, transform both 18 ('svm', SVC(kernel='rbf')) # trained on scaled train fold 19]) 20 21cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 22scores = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy') 23 24print(f"CV Accuracy (leak-free): {scores.mean():.2%} +/- {scores.std():.4f}") 25# → CV Accuracy (leak-free): 97.89% +/- 0.0088 26 27# sklearn Pipeline is the recommended pattern for all preprocessing + modelling
The Complete Testing and Validation Workflow
Combining everything covered in this chapter, here is the rigorous, professional workflow every ML practitioner should follow — from raw data to a final, trustworthy model evaluation.
Complete Model Validation Pipeline
Test Set
Pipeline
on Train+Val
Hyperparams
Full Train Set
on Test Set
Scroll to see full pipeline
Putting It All Together: A Complete End-to-End Example
The following example demonstrates the full professional validation workflow in a single, runnable script: stratified split, pipeline construction, cross-validation, hyperparameter tuning with GridSearchCV, and a single honest final evaluation on the locked test set.
1import numpy as np 2from sklearn.datasets import load_breast_cancer 3from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold 4from sklearn.pipeline import Pipeline 5from sklearn.preprocessing import StandardScaler 6from sklearn.svm import SVC 7from sklearn.metrics import classification_report, roc_auc_score 8 9# ── STEP 1: Load data and lock away the test set ───────────── 10X, y = load_breast_cancer(return_X_y=True) 11X_train, X_test, y_train, y_test = train_test_split( 12 X, y, test_size=0.20, random_state=42, stratify=y 13) 14# X_test, y_test are now locked. DO NOT TOUCH until Step 5. 15 16# ── STEP 2: Build a leak-free Pipeline ─────────────────────── 17pipeline = Pipeline([ 18 ('scaler', StandardScaler()), 19 ('clf', SVC(probability=True)) 20]) 21 22# ── STEP 3 & 4: Hyperparameter tuning via inner CV ─────────── 23param_grid = { 24 'clf__C': [0.1, 1, 10, 100], 25 'clf__kernel': ['rbf', 'linear'], 26 'clf__gamma': ['scale', 'auto'] 27} 28 29inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 30grid_search = GridSearchCV( 31 pipeline, param_grid, 32 cv=inner_cv, scoring='roc_auc', n_jobs=-1, refit=True 33) 34 35# Fit on training data only (refit=True automatically retrains on full train set) 36grid_search.fit(X_train, y_train) 37 38print("Best hyperparameters:", grid_search.best_params_) 39print(f"Best CV ROC-AUC: {grid_search.best_score_:.4f}") 40 41# ── STEP 5: Final evaluation on the locked test set ────────── 42best_model = grid_search.best_estimator_ 43y_pred = best_model.predict(X_test) 44y_pred_prob = best_model.predict_proba(X_test)[:, 1] 45 46print("\n=== FINAL TEST SET RESULTS ===") 47print(classification_report(y_test, y_pred)) 48print(f"Final Test ROC-AUC: {roc_auc_score(y_test, y_pred_prob):.4f}") 49# → Final Test ROC-AUC: 0.9979 50# This is your honest, publishable result.
Validation Checklist Before Reporting Results
Before you share any model performance figure — with a manager, a stakeholder, in a paper, or in production — run through this checklist.
| Check | Question to Ask | Risk if Skipped |
|---|---|---|
| Split Integrity | Was the test set truly held out and never used during development? | Optimistically biased results; model fails in production |
| Stratification | Are class proportions preserved in all splits for classification tasks? | Misleading scores on imbalanced datasets |
| Pipeline Used | Is all preprocessing inside a Pipeline that is fit only on training folds? | Data leakage; scores are invalid and overoptimistic |
| Correct Metric | Is the evaluation metric aligned with the business objective? | Model optimised for the wrong thing (e.g., accuracy on imbalanced data) |
| Multiple Folds | Are results averaged over multiple folds rather than a single split? | High variance in reported score; lucky or unlucky split |
| Baseline Comparison | Is the model compared against a simple baseline (e.g., majority class predictor)? | Model may not actually be better than a trivial rule |
| Temporal Order | For time series data, is training always in the past relative to validation? | Look-ahead bias; completely invalid scores |
| No Duplicates | Are duplicate or near-duplicate rows removed before splitting? | Same sample in both train and test; evaluation is meaningless |
Key Takeaways
- The test set must be locked away immediately and used only once — at the very end — to get an unbiased final performance estimate.
- K-fold cross-validation gives a much more reliable performance estimate than a single train/validation split, especially on smaller datasets.
- Always use Stratified K-Fold for classification tasks to ensure class proportions are preserved across all folds.
- Data leakage — especially fitting preprocessing on the full dataset before splitting — silently inflates your scores and destroys trust in your model's reported performance.
- Wrap all preprocessing and modelling steps in an sklearn Pipeline to guarantee a leak-free cross-validation loop automatically.
- Choose your evaluation metric based on the cost of different errors — accuracy alone is almost never sufficient for imbalanced or high-stakes problems.
- The bias-variance tradeoff is the lens through which you diagnose model behaviour: high training error means underfitting (high bias); high gap between training and test error means overfitting (high variance).
What's Next?
In Chapter 1.5 — The Machine Learning Project Lifecycle, we will zoom out from individual models and metrics to map the complete end-to-end workflow of a real-world ML project — from problem definition and data collection through model deployment, monitoring, and iteration — with a practical framework you can apply immediately.