Handling Imbalanced Datasets: Resampling, SMOTE, and Class Weights

An imbalanced dataset occurs when one class has many more observations than another class. This is very common in classification problems such as fraud detection, loan default prediction, disease screening, customer churn, rare event prediction, and defect detection.

If imbalance is ignored, a model may appear accurate while failing to detect the minority class that matters most. Handling imbalance properly is essential for building useful, fair, and business-relevant classification models.

What is an Imbalanced Dataset?

A dataset is imbalanced when the target classes are not represented equally. For example, in a fraud detection dataset, 99% of transactions may be genuine and only 1% may be fraudulent.

In such cases, a model can achieve 99% accuracy simply by predicting every transaction as genuine. But this model is useless because it misses all fraud cases.

Core Idea: In imbalanced classification, the minority class is often the most important class. High accuracy can be misleading if the model fails to detect that class.

Class Imbalance at a Glance

Visual Intuition

Imbalanced Classes

Majority

Minority

SMOTE Creates Synthetic Points

Threshold Tuning

0.30 0.50

Why Accuracy Can Be Misleading

Accuracy measures the total percentage of correct predictions. In balanced datasets, this can be useful. But in imbalanced datasets, accuracy can hide poor minority-class performance.

Example: Fraud Detection

Suppose a dataset contains 10,000 transactions. Out of these, 9,900 are genuine and 100 are fraudulent.

Model Behaviour	Correct Predictions	Accuracy	Business Usefulness
Predicts all transactions as genuine	9,900 out of 10,000	99%	Very poor, because it catches zero fraud cases.

This is why accuracy alone should not be used for imbalanced classification problems.

Common Imbalanced Classification Problems

Problem	Majority Class	Minority Class	Why Minority Class Matters
Fraud Detection	Genuine transaction.	Fraud transaction.	Fraud cases create financial loss and risk.
Loan Default	Non-default.	Default.	Defaults are financially costly.
Disease Screening	No disease.	Disease present.	Missing positive cases can be dangerous.
Manufacturing Defect Detection	Normal product.	Defective product.	Defects affect quality and safety.
Customer Churn	No churn.	Churn.	Churners need retention action before leaving.

Main Ways to Handle Imbalance

There are several ways to handle imbalanced datasets. The best method depends on the dataset size, class ratio, model type, business cost of errors, and whether probability quality matters.

Method	What It Does	Best Used When	Main Risk
Resampling Random Undersampling	Reduces majority class examples.	Majority class is very large.	May remove useful information.
Resampling Random Oversampling	Duplicates minority class examples.	Minority class is small but reliable.	May overfit duplicated examples.
SMOTE Synthetic Oversampling	Creates synthetic minority examples.	Minority class needs expansion without direct duplication.	Can create unrealistic samples if used carelessly.
Weights Class Weights	Gives higher penalty to minority class errors.	You do not want to alter the dataset distribution.	Can increase false positives if weight is too high.
Threshold Threshold Tuning	Changes the probability cutoff for positive prediction.	You need to control precision-recall trade-off.	Wrong threshold can harm business performance.

Random Undersampling

Random undersampling reduces the number of majority class examples. For example, if there are 100,000 genuine transactions and 1,000 fraud transactions, we may sample fewer genuine transactions to create a more balanced training dataset.

Advantages

Simple and fast.
Reduces training time.
Useful when majority class is extremely large.
Can make minority patterns easier for the model to learn.

Limitations

May discard useful majority-class information.
Can make the model less stable.
May not represent the full majority-class diversity.
Should be tested using validation data with original class distribution.

Random Oversampling

Random oversampling increases the number of minority class examples by duplicating existing minority observations. This gives the model more exposure to the minority class during training.

Advantages

Simple to implement.
Does not remove majority-class data.
Can improve minority-class recall.
Useful when dataset is not too large.

Limitations

Duplicates the same minority examples.
Can increase overfitting.
Does not create new information.
May increase training time.

SMOTE: Synthetic Minority Oversampling Technique

SMOTE creates synthetic minority class examples instead of simply duplicating existing ones. It does this by looking at minority-class neighbors and creating new artificial points between them.

This can help the model learn a broader decision region for the minority class. However, SMOTE should be used carefully because synthetic examples may not always represent realistic business cases.

SMOTE = Create New Minority Samples Between Existing Minority Neighbors

SMOTE expands the minority class by interpolation rather than direct duplication.

Use SMOTE When

Minority class has enough meaningful examples.
Simple duplication causes overfitting.
Feature space is mostly numerical or properly encoded.
You want to improve minority-class recall.

Be Careful When

Minority examples are very noisy.
Classes overlap strongly.
Categorical variables are encoded poorly.
Synthetic samples may be unrealistic.
SMOTE is applied before train-test split.

Class Weights

Class weights tell the model to treat mistakes on different classes differently. If the minority class is more important, the model can be given a higher penalty for misclassifying minority examples.

Class weights are useful because they do not change the actual dataset. Instead, they change how strongly the model responds to each class during training.

Example: Weighted Fraud Detection

If fraud cases are rare but costly, we may assign a higher weight to fraud examples. The model then pays more attention to correctly identifying fraud, even if fraud cases are fewer in number.

Class Weight Approach	How It Works	Possible Effect
Balanced Weights	Weights are automatically adjusted based on class frequency.	Minority class errors receive more penalty.
Manual Weights	User specifies class-specific costs.	Useful when business cost of errors is known.
Cost-Sensitive Learning	Model directly considers different misclassification costs.	Aligns model training with business risk.

Threshold Tuning

Many classifiers output probabilities. A default threshold of 0.5 is often used to convert probabilities into class labels. But in imbalanced problems, 0.5 may not be the best threshold.

Lowering the threshold can increase recall for the minority class, meaning the model catches more positives. However, it may also increase false positives. Raising the threshold can increase precision but may miss more true positives.

Threshold Change	Likely Effect	Business Example
Lower Threshold	More positives predicted, higher recall, more false positives.	Useful when missing fraud is very costly.
Higher Threshold	Fewer positives predicted, higher precision, more false negatives.	Useful when false alarms are very expensive.
Business-Optimized Threshold	Threshold chosen based on cost-benefit trade-off.	Used when each false positive and false negative has measurable cost.

Safe Workflow: Avoiding Data Leakage

Resampling methods must be applied carefully. A common mistake is applying oversampling, undersampling, or SMOTE before splitting the data. This can leak information from validation or test data into training.

High-Risk Mistake: Never apply SMOTE or oversampling before train-test split. Synthetic or duplicated examples can leak patterns from test data into training, making performance look better than it really is.

Leakage-Safe Imbalance Handling Workflow

Split Data First

→

Apply Resampling Only on Training Set

→

Train Model

→

Tune Threshold on Validation Set

→

Evaluate on Untouched Test Set

Evaluation Metrics for Imbalanced Data

Imbalanced datasets require metrics that focus on minority-class detection and error trade-offs. Accuracy alone is usually not enough.

Metric	Meaning	Best Used When
Confusion Matrix	Shows true positives, false positives, true negatives, and false negatives.	You want to understand error types clearly.
Precision	Of predicted positives, how many were truly positive?	False positives are costly.
Recall	Of actual positives, how many did the model catch?	False negatives are costly.
F1 Score	Balance between precision and recall.	Both false positives and false negatives matter.
PR-AUC	Area under the precision-recall curve.	Positive class is rare and important.
ROC-AUC	Measures ranking ability across thresholds.	You want general class separation, but use carefully under heavy imbalance.
Balanced Accuracy	Average of recall across classes.	You want performance that accounts for both majority and minority classes.

Choosing the Right Strategy

Situation	Recommended Strategy	Reason
Very large majority class	Undersampling or class weights.	Reduces training burden or increases minority attention.
Small but reliable minority class	Oversampling or SMOTE.	Gives the model more minority examples to learn from.
High cost of false negatives	Lower threshold, class weights, recall-focused metric.	Catches more positive cases.
High cost of false positives	Higher threshold, precision-focused metric.	Reduces unnecessary positive alerts.
Need probability quality	Calibration check and original-distribution validation.	Resampling may affect probability calibration.

Example: Fraud Detection

Business Problem

A payment company wants to detect fraudulent transactions. Only 0.5% of transactions are fraud. Missing fraud is costly, but too many false alarms can also frustrate customers.

Step	Action	Reason
1	Use stratified train-validation-test split.	Preserve fraud ratio across splits.
2	Apply class weights or SMOTE only on training data.	Improve minority learning without leaking test data.
3	Evaluate using recall, precision, F1, and PR-AUC.	Accuracy is misleading under heavy imbalance.
4	Tune threshold based on fraud investigation capacity.	Balance fraud detection with false alert workload.
5	Monitor model performance after deployment.	Fraud patterns may change over time.

Example: Customer Churn Prediction

Retention Problem

A subscription company wants to identify customers likely to churn. Only 12% of customers churn in a month. If the model predicts everyone as non-churn, accuracy may be high but retention value will be low.

Class weights: Give more importance to churners during model training.
Threshold tuning: Lower threshold to identify more at-risk customers.
Precision-recall balance: Avoid offering discounts to too many customers who would not churn.
Business constraint: Choose threshold based on retention budget and contact capacity.

Example: Loan Default Prediction

Credit Risk Problem

A bank wants to predict default risk. Defaults are less frequent than successful repayments, but false negatives are costly because approving a risky borrower can create financial loss.

Class weights: Penalize default misclassification more heavily.
Threshold selection: Choose a risk threshold for approval, rejection, or manual review.
Metrics: Use recall for default class, precision, ROC-AUC, PR-AUC, and confusion matrix.
Governance: Check fairness, stability, and explainability before deployment.

Common Mistakes in Handling Imbalanced Data

Mistake	Why It Is Harmful	Better Approach
Using accuracy alone	High accuracy can hide poor minority-class detection.	Use precision, recall, F1, PR-AUC, and confusion matrix.
Applying SMOTE before train-test split	Creates leakage and overestimates performance.	Split first, then apply SMOTE only on training data.
Balancing validation or test sets artificially	Evaluation no longer reflects real-world class distribution.	Keep validation and test sets close to real distribution.
Oversampling noisy minority examples	Model may learn noise as if it were signal.	Clean data and inspect minority cases before oversampling.
Ignoring threshold selection	Default 0.5 threshold may not match business cost.	Choose threshold using validation data and business trade-offs.
Assuming resampling fixes everything	Feature quality, model choice, and validation still matter.	Combine imbalance handling with good feature engineering and evaluation.

Best Practices for Imbalanced Classification

Imbalanced Dataset Checklist

Understand class ratio: Always check how many examples exist in each class.
Use stratified splitting: Preserve class distribution across train, validation, and test sets.
Do not rely on accuracy alone: Use recall, precision, F1, PR-AUC, ROC-AUC, and confusion matrix.
Apply resampling only on training data: Avoid leakage into validation or test data.
Try class weights first when suitable: They avoid changing the actual dataset distribution.
Use SMOTE carefully: Check whether synthetic samples make business sense.
Tune thresholds: Match the prediction cutoff to business costs and operational capacity.
Evaluate on original distribution: Validation and test data should reflect real-world class proportions.
Monitor after deployment: Minority-class patterns may drift over time.

Why Imbalance Handling is a Business Decision

Handling imbalance is not only a technical task. It is also a business decision because different errors have different costs. In fraud detection, missing fraud may be worse than investigating a false alert. In marketing, contacting too many low-risk customers may waste budget.

The right approach depends on what the business wants to optimize: catching more positives, reducing false alarms, improving ranking quality, protecting customer experience, or minimizing financial loss.

Practical Insight: The goal is not always to perfectly balance the dataset. The real goal is to build a model that makes better decisions under real-world class imbalance and business constraints.

Key Takeaways

Imbalanced datasets occur when one class is much more frequent than another.
Accuracy can be misleading because the model may ignore the minority class.
Important imbalance-handling methods include undersampling, oversampling, SMOTE, class weights, and threshold tuning.
SMOTE creates synthetic minority examples instead of simply duplicating existing ones.
Class weights penalize minority-class errors more heavily during training.
Threshold tuning controls the trade-off between precision and recall.
Resampling should be applied only to training data, never before train-test split.
Use metrics such as recall, precision, F1, PR-AUC, balanced accuracy, and confusion matrix.
The best strategy depends on business cost, class ratio, model type, and operational constraints.

6.4 Handling imbalanced datasets

Handling Imbalanced Datasets: Resampling, SMOTE, and Class Weights

What is an Imbalanced Dataset?

Class Imbalance at a Glance

Visual Intuition

Why Accuracy Can Be Misleading

Example: Fraud Detection

Common Imbalanced Classification Problems

Main Ways to Handle Imbalance

Random Undersampling

Random Oversampling

SMOTE: Synthetic Minority Oversampling Technique

Class Weights

Example: Weighted Fraud Detection

Threshold Tuning

Safe Workflow: Avoiding Data Leakage

Leakage-Safe Imbalance Handling Workflow

Evaluation Metrics for Imbalanced Data

Choosing the Right Strategy

Example: Fraud Detection

Business Problem

Example: Customer Churn Prediction

Retention Problem

Example: Loan Default Prediction

Credit Risk Problem

Common Mistakes in Handling Imbalanced Data

Best Practices for Imbalanced Classification

Imbalanced Dataset Checklist

Why Imbalance Handling is a Business Decision

Key Takeaways