1.5 The Machine Lea

The Machine Learning Project Lifecycle

A machine learning model is never just a model. Behind every production system is a carefully orchestrated sequence of decisions, experiments, and engineering steps — from the moment a business problem is identified to the point where a model is actively running in production and being watched around the clock. That sequence is the Machine Learning Project Lifecycle.

Why a Structured Lifecycle Matters

Most beginner ML tutorials focus exclusively on the modelling step — loading a dataset, training a model, and printing an accuracy score. In reality, the modelling step accounts for roughly 10 to 20 percent of the total effort in a professional ML project. The remaining 80 to 90 percent is everything else: understanding the problem, gathering clean data, engineering features, evaluating rigorously, deploying safely, and monitoring behaviour over time.

Skipping or rushing any phase causes compounding problems downstream. A poorly framed problem leads to a model that solves the wrong thing. Poor data quality leads to a model that learns noise. A model deployed without monitoring can silently degrade and cause real business harm before anyone notices. A structured lifecycle exists precisely to prevent these failures.

Industry finding: According to Gartner, roughly 85 percent of AI and ML projects fail to move from prototype to production. The primary reasons are not algorithmic — they are process failures: misaligned objectives, data quality issues, and the absence of monitoring strategies.

The Eight Phases at a Glance

The lifecycle of a machine learning project can be divided into eight distinct phases. Each phase has a clear purpose, a set of inputs, and a defined output that feeds into the next phase.

Problem Definition and Goal Framing

Translate a business objective into a concrete ML task with measurable success criteria, defined scope, and assessed feasibility.

Deliverable: Problem statement document and success metrics

Data Collection and Acquisition

Identify, gather, and consolidate all relevant data from internal databases, APIs, public datasets, or web sources.

Deliverable: Raw dataset stored in a versioned repository

Exploratory Data Analysis (EDA)

Understand the structure, quality, distributions, correlations, and anomalies in the data before any modelling begins.

Deliverable: EDA notebook with visualisations and data quality report

Data Preprocessing and Feature Engineering

Clean the data, handle missing values, encode categoricals, scale numerics, and construct informative new features.

Deliverable: Cleaned, feature-rich dataset and preprocessing pipeline

Model Selection and Training

Select candidate algorithms, establish a baseline, train models, and tune hyperparameters using systematic search strategies.

Deliverable: Trained model artefact with documented hyperparameters

Model Evaluation and Validation

Rigorously assess model performance on held-out data using appropriate metrics, cross-validation, and error analysis.

Deliverable: Evaluation report with generalisation estimates and residual analysis

Model Deployment

Package the trained model as a service (REST API, batch job, or embedded system) and release it into a staging or production environment.

Deliverable: Production-ready model endpoint with versioning and rollback capability

Monitoring and Maintenance

Continuously track model performance, detect data drift and concept drift, trigger retraining pipelines, and iterate on improvements.

Deliverable: Monitoring dashboard and retraining schedule

Phase 1 — Problem Definition and Goal Framing

Before writing a single line of code or collecting a single row of data, the most important question must be answered: what exactly are we trying to solve, and is machine learning the right tool? Problem definition is the most undervalued phase in the lifecycle, yet it is the one that determines the success of everything that follows.

A business stakeholder might say: "We want to reduce customer churn." That is a goal, not an ML problem. The ML practitioner must translate it: "Given a customer's last 90 days of behaviour — logins, purchases, support tickets, and session duration — predict whether they will cancel their subscription within the next 30 days, achieving at least 80 percent recall on the churning class." This level of specificity defines the task type, the input features, the prediction horizon, and the acceptable performance threshold.

Questions to Answer at This Phase

What is the business objective and how will solving this ML problem impact it?
Is this a regression, classification, clustering, or reinforcement learning problem?
What is the prediction target and at what time horizon?
What performance metric aligns with the business goal (accuracy, recall, AUC, RMSE)?
What is the minimum acceptable performance threshold for production deployment?
What data is available and is there enough of it?
Are there legal, privacy, or ethical constraints on the data or predictions?
What does the current non-ML solution look like, and what is the baseline to beat?

Critical distinction: Always define your success metric before you train any model. Choosing metrics after seeing results introduces selection bias and almost always leads to an over-optimistic view of model quality. Business stakeholders and the ML team must agree on the metric together.

Phase 2 — Data Collection and Acquisition

Machine learning models are only as good as the data they are trained on. Phase 2 involves identifying all relevant data sources, understanding their schemas and update frequencies, and consolidating them into a single working dataset. Data can come from internal databases, third-party APIs, public repositories, web scraping, sensor streams, or data labelling exercises.

During collection, it is equally important to document the provenance of every data source — what it contains, when it was collected, how it was sampled, and what biases might exist. A model trained on historically biased data will reproduce and amplify that bias at inference time. Version your raw data immediately; raw data should never be modified in place.

Python — Loading Data from Multiple Sources

 1import pandas as pd
 2import requests
 3from sqlalchemy import create_engine
 4
 5# ── Source 1: Load from a CSV file
 6df_csv = pd.read_csv('customer_transactions.csv', parse_dates=['transaction_date'])
 7print(f"CSV rows: {df_csv.shape[0]} | Memory: {df_csv.memory_usage(deep=True).sum() / 1e6:.1f} MB")
 8
 9# ── Source 2: Load from a REST API
10response = requests.get(
11    'https://api.example.com/v1/churn-labels',
12    headers={'Authorization': 'Bearer YOUR_TOKEN'}
13)
14response.raise_for_status()               # throws if request failed
15df_api = pd.DataFrame(response.json()['records'])
16
17# ── Source 3: Load from a SQL database
18engine = create_engine('postgresql://user:password@host:5432/mydb')
19query  = """
20    SELECT customer_id, plan_type, signup_date,
21           support_tickets_90d, avg_session_minutes
22    FROM customers
23    WHERE signup_date >= '2022-01-01'
24"""
25df_sql = pd.read_sql(query, engine)
26
27# ── Merge sources on the common key
28df = df_csv.merge(df_api,  on='customer_id', how='left')
29df = df.merge(df_sql, on='customer_id', how='left')
30print(f"Final dataset: {df.shape}")  # → Final dataset: (45320, 18)

Data Volume Rule of Thumb: For tabular classification or regression, aim for at least 10 times as many examples as features. For deep learning models, you typically need hundreds of thousands to millions of examples. If you do not have enough data, consider transfer learning, data augmentation, or synthetic data generation before proceeding.

Phase 3 — Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of becoming deeply familiar with your dataset before making any modelling decisions. The goal is to understand what the data contains, what is missing, what distributions the features follow, how features relate to one another, and whether there are any anomalies, outliers, or data quality issues that must be addressed.

EDA is not a box to tick — it is an open-ended investigation. Good EDA reveals the structure that should guide your feature engineering choices, model selection, and the appropriate evaluation strategy. For example, discovering severe class imbalance during EDA informs the decision to use stratified splits and choose precision-recall AUC over accuracy.

Python — Systematic EDA Workflow

 1import pandas as pd
 2import numpy  as np
 3
 4# ── 1. Structural overview
 5print("Shape:", df.shape)           # rows, columns
 6print(df.dtypes.value_counts())    # how many numeric vs object columns
 7print(df.head(3))
 8
 9# ── 2. Missing value audit
10missing = df.isnull().sum()
11missing_report = pd.DataFrame({
12    'Count':   missing,
13    'Percent': (missing / len(df) * 100).round(2)
14}).sort_values('Percent', ascending=False)
15print(missing_report[missing_report['Count'] > 0])
16
17# ── 3. Target variable distribution
18print(df['churned'].value_counts(normalize=True).round(3))
19# churned  0: 0.932   1: 0.068  → severe class imbalance!
20
21# ── 4. Correlation with target (numerical features)
22num_df = df.select_dtypes(include=[np.number])
23correlations = num_df.corr()['churned'].sort_values(key=abs, ascending=False)
24print(correlations.drop('churned').head(8))
25
26# ── 5. Duplicate check
27n_dupes = df.duplicated().sum()
28print(f"Duplicate rows: {n_dupes} ({n_dupes / len(df) * 100:.2f}%)")
29
30# ── 6. Statistical summary
31print(df.describe(percentiles=[.01, .25, .50, .75, .99]))

What to look for in EDA: Skewed distributions that require log transforms. Missing value patterns (random vs systematic). Outliers that could be genuine extreme values or data entry errors. Class imbalance in the target. Multicollinearity between features. Temporal patterns or seasonality in time-based data. Any of these findings directly inform the next phase.

Phase 4 — Data Preprocessing and Feature Engineering

Raw data is almost never ready for a model. Phase 4 transforms the raw dataset into a clean, structured, model-ready form. This involves two related but distinct activities. Data preprocessing is about fixing problems: filling missing values, encoding categorical variables, scaling numerical features, and removing duplicates. Feature engineering is about creation: building new, more informative features from existing ones to help the model capture domain knowledge it could not discover on its own.

The most important engineering practice here is to build preprocessing as a scikit-learn Pipeline. A pipeline packages the entire transformation sequence into a single object that can be fitted on training data and applied identically to validation, test, and future production data — eliminating the risk of data leakage and inconsistency.

Python — Preprocessing Pipeline with ColumnTransformer

 1from sklearn.pipeline   import Pipeline
 2from sklearn.compose    import ColumnTransformer
 3from sklearn.impute     import SimpleImputer
 4from sklearn.preprocessing import StandardScaler, OneHotEncoder
 5import numpy as np
 6
 7# ── Define feature groups
 8numerical_features    = ['age', 'avg_session_minutes', 'support_tickets_90d']
 9categorical_features  = ['plan_type', 'city', 'payment_method']
10
11# ── Numerical pipeline: impute median → scale to unit variance
12num_pipeline = Pipeline([
13    ('imputer', SimpleImputer(strategy='median')),
14    ('scaler',  StandardScaler())
15])
16
17# ── Categorical pipeline: impute mode → one-hot encode
18cat_pipeline = Pipeline([
19    ('imputer', SimpleImputer(strategy='most_frequent')),
20    ('encoder',  OneHotEncoder(handle_unknown='ignore', sparse_output=False))
21])
22
23# ── Combine into a single preprocessor
24preprocessor = ColumnTransformer([
25    ('num', num_pipeline, numerical_features),
26    ('cat', cat_pipeline, categorical_features)
27])
28
29# ── Feature engineering: add derived feature before the pipeline
30df['tickets_per_session'] = (
31    df['support_tickets_90d'] / (df['avg_session_minutes'] + 1)
32)
33# The pipeline handles all transformations consistently at fit and predict time

Data leakage warning: Always fit your preprocessing pipeline on the training set only, then apply it to validation and test sets. Fitting on the full dataset before splitting leaks test information into the model, producing artificially inflated evaluation scores that will not reflect real-world performance.

Phase 5 — Model Selection and Training

With a clean, preprocessed dataset, model training can begin. The most important rule at this phase is: always start with the simplest model that could possibly work. A logistic regression or linear regression is your baseline. If a complex model does not substantially outperform the baseline, the added complexity is not justified and will make the system harder to maintain, explain, and debug.

After establishing a baseline, progress to more expressive models and use cross-validation to compare them fairly. The choice of algorithm depends on the problem type, dataset size, interpretability requirements, and available compute.

Binary Classification

Logistic Regression (baseline) then Gradient Boosting

XGBoost or LightGBM often performs best on tabular data for this task type.

Regression

Linear Regression (baseline) then Random Forest or GBM

Consider Ridge/Lasso when features outnumber samples or multicollinearity exists.

Multi-Class Classification

Softmax Regression then Random Forest or CatBoost

One-vs-Rest decomposition works well with many classical algorithms.

Clustering

K-Means (baseline) then DBSCAN or GMM

Use the Elbow method and Silhouette score to select the number of clusters.

Image / Vision

Pre-trained CNN (transfer learning)

ResNet50 or EfficientNet-B0 with fine-tuning beats training from scratch in most cases.

Text / NLP

TF-IDF + Logistic Regression (baseline) then BERT

Transformer-based models deliver the best accuracy; the classical baseline is fast and interpretable.

Python — Baseline to Advanced Model Comparison

 1from sklearn.pipeline      import Pipeline
 2from sklearn.linear_model   import LogisticRegression
 3from sklearn.ensemble        import RandomForestClassifier, GradientBoostingClassifier
 4from sklearn.model_selection import cross_val_score
 5import numpy as np
 6
 7# Candidate models — note: preprocessor is shared across all
 8candidates = {
 9    'Logistic Regression':    LogisticRegression(max_iter=1000, random_state=42),
10    'Random Forest':          RandomForestClassifier(n_estimators=200, random_state=42),
11    'Gradient Boosting':      GradientBoostingClassifier(n_estimators=300, random_state=42),
12}
13
14for name, clf in candidates.items():
15    pipe = Pipeline([
16        ('prep',  preprocessor),      # from Phase 4
17        ('model', clf)
18    ])
19    scores = cross_val_score(pipe, X_train, y_train,
20                               cv=5, scoring='roc_auc', n_jobs=-1)
21    print(f"{name:25s} ROC-AUC: {scores.mean():.4f} +/- {scores.std():.4f}")
22
23# → Logistic Regression       ROC-AUC: 0.8124 +/- 0.0097
24# → Random Forest             ROC-AUC: 0.8791 +/- 0.0062
25# → Gradient Boosting         ROC-AUC: 0.9043 +/- 0.0055

Phase 6 — Model Evaluation and Validation

Training accuracy is not the same as real-world accuracy. Phase 6 rigorously assesses whether the model generalises beyond the data it was trained on. The held-out test set — which must never be touched during training or hyperparameter tuning — gives the final unbiased estimate of performance. Cross-validation on the training set gives robust interim estimates during development.

The choice of metric is critical and must match the business objective defined in Phase 1. A fraud detection system where missing a fraud is catastrophic should optimise for recall, not accuracy. A recommendation system may prioritise Precision@k. A regression model for demand forecasting may care about MAE more than RMSE, because MAE is more interpretable to the business.

Python — Stratified Cross-Validation and Final Test Evaluation

 1from sklearn.metrics       import classification_report, roc_auc_score
 2from sklearn.model_selection import StratifiedKFold, cross_validate
 3
 4# ── Stratified 5-fold: preserves class ratio in each fold
 5cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
 6results = cross_validate(
 7    best_pipeline, X_train, y_train, cv=cv,
 8    scoring=['roc_auc', 'f1', 'precision', 'recall'],
 9    return_train_score=True
10)
11
12print("Metric        | Train  | Val    | Gap (overfit risk)")
13print("-" * 55)
14for m in ['roc_auc', 'f1', 'precision', 'recall']:
15    tr = results[f'train_{m}'].mean()
16    vl = results[f'test_{m}'].mean()
17    print(f"{m:13s} | {tr:.4f} | {vl:.4f} | {tr-vl:.4f}")
18
19# ── Final evaluation on the held-out test set (done only once!)
20best_pipeline.fit(X_train, y_train)
21y_pred = best_pipeline.predict(X_test)
22y_prob = best_pipeline.predict_proba(X_test)[:,1]
23print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))
24print(f"Test ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
25# → Test ROC-AUC: 0.9031

The cardinal rule of evaluation: The test set is touched exactly once — at the very end, to report the final number. If you evaluate on the test set, dislike the result, adjust your model, and evaluate again, you have effectively trained on the test set. This is a form of overfitting to the test data and will produce results that do not hold up in production.

Phase 7 — Model Deployment

A model that performs well in a Jupyter notebook but is never deployed provides zero business value. Deployment is the process of packaging a trained model and making it accessible to other systems or end users. The most common deployment pattern for ML models is a REST API — an HTTP service that accepts input data and returns predictions. FastAPI is the industry standard for this in Python due to its speed, automatic documentation, and type safety.

Before serving a model in production, it must be saved (serialised) to disk using Joblib or Pickle. The serialised file includes both the trained preprocessing pipeline and the model, so every incoming request is transformed identically to how training data was processed. The deployment artefact should be versioned so rollback is always possible if a new model underperforms.

Python — Save Model + Serve via FastAPI

 1# ── save_model.py: serialise the trained pipeline
 2import joblib
 3
 4joblib.dump(best_pipeline, 'churn_model_v1.pkl')
 5print("Model saved successfully.")
 6
 7# ─────────────────────────────────────────────────────────────────
 8# ── api.py: serve predictions via FastAPI
 9from fastapi  import FastAPI
10from pydantic import BaseModel
11import pandas as pd, joblib
12
13app   = FastAPI(title="Churn Predictor", version="1.0")
14model = joblib.load('churn_model_v1.pkl')
15
16class CustomerFeatures(BaseModel):
17    age:                   float
18    avg_session_minutes:    float
19    support_tickets_90d:    int
20    plan_type:             str
21    city:                  str
22    payment_method:        str
23
24@app.post("/predict")
25def predict(customer: CustomerFeatures):
26    df    = pd.DataFrame([customer.dict()])
27    prob  = float(model.predict_proba(df)[0, 1])
28    label = "Churn Risk" if prob > 0.5 else "Retained"
29    return {"prediction": label, "churn_probability": round(prob, 4), "model_version": "1.0"}
30
31# Run with: uvicorn api:app --host 0.0.0.0 --port 8000
32# Auto docs at: http://localhost:8000/docs

Deployment strategies: The three most common patterns are (1) Online inference — a REST API serving one prediction at a time with low latency, ideal for customer-facing applications. (2) Batch inference — a scheduled job scoring thousands of records overnight, ideal for offline use cases like churn scoring. (3) Edge inference — the model is deployed on-device (phone, sensor, camera) for real-time performance without network dependency.

Phase 8 — Monitoring and Maintenance

Deploying a model is not the finish line — it is the start of a new responsibility. In production, the world keeps changing. The statistical properties of the incoming data change over time, the relationship between features and the target changes, and model performance gradually degrades. This degradation is called model drift, and it is inevitable without an active monitoring strategy.

There are two primary types of drift to monitor. Data drift (also called covariate shift) occurs when the distribution of the input features changes — for example, a new customer segment starts using your product and their feature values look very different from your training data. Concept drift occurs when the relationship between features and the target changes — what used to predict churn may no longer do so as customer behaviour evolves.

Model Performance Monitoring

Track prediction metrics (accuracy, AUC, F1) over time using ground truth labels once they become available. Set alert thresholds at a percentage drop below the baseline.

Tools: MLflow, Prometheus + Grafana, Weights and Biases

Data Drift Detection

Compare the distribution of incoming feature values against the training distribution using statistical tests such as the Population Stability Index, Kolmogorov-Smirnov test, or Jensen-Shannon divergence.

Tools: Evidently AI, Alibi Detect, Nannyml

Prediction Distribution Monitoring

Monitor the distribution of model output scores. A sudden shift in the average predicted probability or a spike in extreme values often indicates upstream data quality issues.

Tools: WhyLogs, Great Expectations

Retraining Pipelines

Define retraining triggers: time-based (every 30 days), performance-based (AUC drops below 0.85), or drift-based (PSI exceeds 0.2). Automate retraining and shadow-deploy new models before promoting.

Tools: Airflow, Prefect, Kubeflow Pipelines

The Iterative Nature of ML Projects

The eight phases described above are not a strict waterfall sequence — they are a cyclical, iterative process. In practice, insights discovered in Phase 3 (EDA) send you back to Phase 2 to collect more data. A poor evaluation result in Phase 6 sends you back to Phase 4 to engineer better features or to Phase 5 to try a different algorithm. Monitoring alerts in Phase 8 trigger a full retraining cycle starting from Phase 2. The diagram below shows the most common feedback loops.

The ML Project Lifecycle — Key Feedback Loops

1
Problem
Definition

→↓

2
Data
Collection

→↓

3
EDA

→↓

4
Preprocessing
& Features

↑ if goal changes ↓

8
Monitoring
& Retrain

←↓

7
Deployment

←↓

6
Evaluation
& Validation

←↓

5
Model
Training

Poor evaluation (Phase 6) loops back to feature engineering (Phase 4) or model selection (Phase 5). Monitoring alerts (Phase 8) trigger a full retraining cycle from Phase 2 or Phase 4.

Connection to CRISP-DM: The Industry Standard Framework

The lifecycle described in this chapter closely mirrors CRISP-DM (Cross-Industry Standard Process for Data Mining), a framework developed in the late 1990s that remains widely used in industry today. Understanding the correspondence between the two helps when working within organisations that use CRISP-DM terminology.

CRISP-DM (6 Phases)

Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

This Lifecycle (8 Phases)

Phase 1: Problem Definition and Goal Framing
Phase 2: Data Collection + Phase 3: EDA
Phase 4: Preprocessing and Feature Engineering
Phase 5: Model Selection and Training
Phase 6: Model Evaluation and Validation
Phase 7: Deployment + Phase 8: Monitoring

The main difference is that this lifecycle explicitly separates EDA from data collection, and monitoring from deployment — reflecting how modern ML engineering practices have matured since CRISP-DM was conceived. Both frameworks emphasise the iterative, non-linear nature of data projects.

Real-World Walkthrough: House Price Prediction End-to-End

To make the lifecycle concrete, here is a condensed but fully functional end-to-end pipeline for house price prediction. Each comment maps to its lifecycle phase, showing how the eight phases translate into actual code that runs as a single coherent system.

Python — End-to-End ML Pipeline (Phases 1–7 in One Script)

 1# ════ Phase 1: Problem Definition ════════════════════════════════
 2# Task: Regression — predict house sale price
 3# Metric: MAE (interpretable to business) and R² (variance explained)
 4# Baseline: median of training prices — anything below this is trivial
 5
 6import pandas as pd, numpy as np, joblib
 7from sklearn.model_selection  import train_test_split
 8from sklearn.pipeline         import Pipeline
 9from sklearn.compose          import ColumnTransformer
10from sklearn.preprocessing    import StandardScaler, OneHotEncoder
11from sklearn.impute           import SimpleImputer
12from sklearn.ensemble         import GradientBoostingRegressor
13from sklearn.metrics          import mean_absolute_error, r2_score
14
15# ════ Phase 2: Data Collection ════════════════════════════════════
16df = pd.read_csv('house_prices.csv')
17
18# ════ Phase 3: EDA (findings → decisions below) ══════════════════
19# Discovered: SalePrice is right-skewed → apply log transform
20# Discovered: GarageYrBlt has 81 missing → impute with 0 (no garage)
21# Discovered: LotArea has extreme outliers → cap at 99th percentile
22
23# ════ Phase 4: Preprocessing and Feature Engineering ══════════════
24df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
25df['HouseAge'] = df['YrSold'] - df['YearBuilt']
26
27X = df.drop('SalePrice', axis=1)
28y = np.log1p(df['SalePrice'])   # log-transform from EDA finding
29X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
30
31num_cols = X.select_dtypes(include='number').columns.tolist()
32cat_cols = X.select_dtypes(include='object').columns.tolist()
33
34preprocessor = ColumnTransformer([
35    ('num', Pipeline([
36        ('imp', SimpleImputer(strategy='median')),
37        ('sc',  StandardScaler())
38    ]), num_cols),
39    ('cat', Pipeline([
40        ('imp', SimpleImputer(strategy='most_frequent')),
41        ('enc', OneHotEncoder(handle_unknown='ignore'))
42    ]), cat_cols),
43])
44
45# ════ Phase 5: Model Training ═════════════════════════════════════
46pipeline = Pipeline([
47    ('prep',  preprocessor),
48    ('model', GradientBoostingRegressor(n_estimators=500,
49                                           learning_rate=0.05,
50                                           max_depth=4,
51                                           random_state=42))
52])
53pipeline.fit(X_train, y_train)
54
55# ════ Phase 6: Evaluation ════════════════════════════════════════
56y_pred_log = pipeline.predict(X_test)
57y_pred     = np.expm1(y_pred_log)   # reverse log transform
58y_true     = np.expm1(y_test)
59print(f"MAE: ${mean_absolute_error(y_true, y_pred):,.0f}")   # → $18,243
60print(f"R²:  {r2_score(y_true, y_pred):.4f}")               # → 0.9112
61
62# ════ Phase 7: Save and Deploy ═══════════════════════════════════
63joblib.dump(pipeline, 'house_price_model_v1.pkl')
64print("Pipeline serialised. Ready to deploy.")

Result interpretation: An MAE of $18,243 means the model's predictions are off by an average of $18,243 per house. An R² of 0.9112 means the model explains 91.12 percent of the variance in sale prices on unseen data. Whether this is good enough depends on the business context defined in Phase 1 — if the business team set a threshold of MAE below $20,000, this model is ready for deployment.

Common Pitfalls at Each Phase

Understanding where ML projects typically go wrong is as important as understanding what to do right. The table below maps the most frequent and damaging pitfalls to the phase where they occur.

Phase	Common Pitfall	Severity	How to Avoid It
1 — Problem Definition	Solving a proxy metric that does not align with the business goal (e.g., optimising accuracy when the business cares about revenue impact)	High	Involve business stakeholders in metric selection. Document the metric and get written sign-off before modelling begins.
2 — Data Collection	Training on data that would not be available at prediction time (future leakage)	High	Map every feature to its real-time availability. Simulate what data you have access to at the exact moment of prediction.
3 — EDA	Skipping EDA entirely and going straight to modelling	High	Treat EDA as mandatory. A minimum viable EDA should always cover shape, missing values, target distribution, and key correlations.
4 — Preprocessing	Fitting the scaler or imputer on the full dataset before splitting (causing data leakage)	High	Always use scikit-learn Pipelines. Fit transformers on training data only, then apply to validation and test data.
5 — Model Training	Skipping the baseline and jumping to complex models immediately	Medium	A logistic regression or linear regression baseline takes five minutes to build and sets a realistic bar for improvement.
6 — Evaluation	Evaluating on the test set multiple times and selecting the model with the best test score	High	Use the test set exactly once. Use cross-validation on the training set for all model comparisons and hyperparameter tuning.
7 — Deployment	Deploying a model without versioning or without a rollback strategy	Medium	Version every model artefact with a timestamp and performance record. Always test in a staging environment before production promotion.
8 — Monitoring	Treating deployment as the end of the project and never monitoring the model	High	Set up monitoring dashboards and drift detection before deployment day. Define retraining triggers and a responsible team from the start.

Tools and Frameworks at Every Phase

The table below provides a practical reference of the most widely used tools for each phase of the ML lifecycle. You do not need to master all of them — start with the essentials and expand your toolkit as needed.

Phase	Essential Tools	Advanced / Production Tools
1 — Problem Definition	Confluence, Notion, Google Docs, Miro (for stakeholder workshops)	JIRA Linear
2 — Data Collection	Pandas SQLAlchemy Requests	Apache Spark DVC (Data Version Control) DeltaLake
3 — EDA	Pandas Matplotlib Seaborn	ydata-profiling Sweetviz Plotly
4 — Preprocessing	scikit-learn Pipeline NumPy	Feature Engine category_encoders Feast (Feature Store)
5 — Model Training	scikit-learn XGBoost LightGBM	PyTorch TensorFlow CatBoost
6 — Evaluation	scikit-learn metrics MLflow Tracking	Weights and Biases Neptune.ai Optuna
7 — Deployment	FastAPI Docker Joblib	AWS SageMaker GCP Vertex AI Kubernetes
8 — Monitoring	Evidently AI Prometheus	WhyLogs Nannyml Grafana Alibi Detect

Key Takeaways

The ML lifecycle has eight phases: Problem Definition, Data Collection, EDA, Preprocessing, Model Training, Evaluation, Deployment, and Monitoring — each with a defined deliverable.
Modelling is only 10 to 20 percent of a real ML project. The remaining effort is split across data, evaluation, deployment, and maintenance.
Problem Definition is the most impactful phase — a misframed problem wastes every hour spent on the phases that follow.
Data leakage is the most common and damaging source of overly optimistic evaluation results. Scikit-learn Pipelines are the standard defence against it.
Always establish a simple baseline model before progressing to complex algorithms. The baseline sets the bar and anchors the cost-benefit of complexity.
The test set must be used exactly once — at the very end. All model selection and tuning must be done using cross-validation on the training set alone.
The lifecycle is iterative, not linear. Findings at any phase frequently require returning to an earlier one.
Deployment without monitoring is incomplete. Model drift is inevitable, and monitoring plus retraining pipelines are what keep a system valuable over time.

What's Next?

With a solid understanding of the complete ML project lifecycle, you are ready to build the mathematical foundations that underpin every algorithm. In Chapter 2.1 — Linear Algebra for Machine Learning, we begin the core mathematics series: vectors, matrices, dot products, matrix decompositions, and how these concepts directly map to the operations inside every ML model from linear regression to neural networks.

rning Project Lifecycle

1.5 The Machine Learning Project Lifecycle