Handling Imbalanced Datasets: Resampling, SMOTE, and Class Weights
An imbalanced dataset occurs when one class has many more observations than another class. This is very common in classification problems such as fraud detection, loan default prediction, disease screening, customer churn, rare event prediction, and defect detection.
If imbalance is ignored, a model may appear accurate while failing to detect the minority class that matters most. Handling imbalance properly is essential for building useful, fair, and business-relevant classification models.
What is an Imbalanced Dataset?
A dataset is imbalanced when the target classes are not represented equally. For example, in a fraud detection dataset, 99% of transactions may be genuine and only 1% may be fraudulent.
In such cases, a model can achieve 99% accuracy simply by predicting every transaction as genuine. But this model is useless because it misses all fraud cases.
Core Idea: In imbalanced classification, the minority class is often the most important class. High accuracy can be misleading if the model fails to detect that class.
Class Imbalance at a Glance
Visual Intuition
Why Accuracy Can Be Misleading
Accuracy measures the total percentage of correct predictions. In balanced datasets, this can be useful. But in imbalanced datasets, accuracy can hide poor minority-class performance.
Example: Fraud Detection
Suppose a dataset contains 10,000 transactions. Out of these, 9,900 are genuine and 100 are fraudulent.
| Model Behaviour | Correct Predictions | Accuracy | Business Usefulness |
|---|---|---|---|
| Predicts all transactions as genuine | 9,900 out of 10,000 | 99% | Very poor, because it catches zero fraud cases. |
This is why accuracy alone should not be used for imbalanced classification problems.
Common Imbalanced Classification Problems
| Problem | Majority Class | Minority Class | Why Minority Class Matters |
|---|---|---|---|
| Fraud Detection | Genuine transaction. | Fraud transaction. | Fraud cases create financial loss and risk. |
| Loan Default | Non-default. | Default. | Defaults are financially costly. |
| Disease Screening | No disease. | Disease present. | Missing positive cases can be dangerous. |
| Manufacturing Defect Detection | Normal product. | Defective product. | Defects affect quality and safety. |
| Customer Churn | No churn. | Churn. | Churners need retention action before leaving. |
Main Ways to Handle Imbalance
There are several ways to handle imbalanced datasets. The best method depends on the dataset size, class ratio, model type, business cost of errors, and whether probability quality matters.
| Method | What It Does | Best Used When | Main Risk |
|---|---|---|---|
| Resampling Random Undersampling |
Reduces majority class examples. | Majority class is very large. | May remove useful information. |
| Resampling Random Oversampling |
Duplicates minority class examples. | Minority class is small but reliable. | May overfit duplicated examples. |
| SMOTE Synthetic Oversampling |
Creates synthetic minority examples. | Minority class needs expansion without direct duplication. | Can create unrealistic samples if used carelessly. |
| Weights Class Weights |
Gives higher penalty to minority class errors. | You do not want to alter the dataset distribution. | Can increase false positives if weight is too high. |
| Threshold Threshold Tuning |
Changes the probability cutoff for positive prediction. | You need to control precision-recall trade-off. | Wrong threshold can harm business performance. |
Random Undersampling
Random undersampling reduces the number of majority class examples. For example, if there are 100,000 genuine transactions and 1,000 fraud transactions, we may sample fewer genuine transactions to create a more balanced training dataset.
- Simple and fast.
- Reduces training time.
- Useful when majority class is extremely large.
- Can make minority patterns easier for the model to learn.
- May discard useful majority-class information.
- Can make the model less stable.
- May not represent the full majority-class diversity.
- Should be tested using validation data with original class distribution.
Random Oversampling
Random oversampling increases the number of minority class examples by duplicating existing minority observations. This gives the model more exposure to the minority class during training.
- Simple to implement.
- Does not remove majority-class data.
- Can improve minority-class recall.
- Useful when dataset is not too large.
- Duplicates the same minority examples.
- Can increase overfitting.
- Does not create new information.
- May increase training time.
SMOTE: Synthetic Minority Oversampling Technique
SMOTE creates synthetic minority class examples instead of simply duplicating existing ones. It does this by looking at minority-class neighbors and creating new artificial points between them.
This can help the model learn a broader decision region for the minority class. However, SMOTE should be used carefully because synthetic examples may not always represent realistic business cases.
- Minority class has enough meaningful examples.
- Simple duplication causes overfitting.
- Feature space is mostly numerical or properly encoded.
- You want to improve minority-class recall.
- Minority examples are very noisy.
- Classes overlap strongly.
- Categorical variables are encoded poorly.
- Synthetic samples may be unrealistic.
- SMOTE is applied before train-test split.
Class Weights
Class weights tell the model to treat mistakes on different classes differently. If the minority class is more important, the model can be given a higher penalty for misclassifying minority examples.
Class weights are useful because they do not change the actual dataset. Instead, they change how strongly the model responds to each class during training.
Example: Weighted Fraud Detection
If fraud cases are rare but costly, we may assign a higher weight to fraud examples. The model then pays more attention to correctly identifying fraud, even if fraud cases are fewer in number.
| Class Weight Approach | How It Works | Possible Effect |
|---|---|---|
| Balanced Weights | Weights are automatically adjusted based on class frequency. | Minority class errors receive more penalty. |
| Manual Weights | User specifies class-specific costs. | Useful when business cost of errors is known. |
| Cost-Sensitive Learning | Model directly considers different misclassification costs. | Aligns model training with business risk. |
Threshold Tuning
Many classifiers output probabilities. A default threshold of 0.5 is often used to convert probabilities into class labels. But in imbalanced problems, 0.5 may not be the best threshold.
Lowering the threshold can increase recall for the minority class, meaning the model catches more positives. However, it may also increase false positives. Raising the threshold can increase precision but may miss more true positives.
| Threshold Change | Likely Effect | Business Example |
|---|---|---|
| Lower Threshold | More positives predicted, higher recall, more false positives. | Useful when missing fraud is very costly. |
| Higher Threshold | Fewer positives predicted, higher precision, more false negatives. | Useful when false alarms are very expensive. |
| Business-Optimized Threshold | Threshold chosen based on cost-benefit trade-off. | Used when each false positive and false negative has measurable cost. |
Safe Workflow: Avoiding Data Leakage
Resampling methods must be applied carefully. A common mistake is applying oversampling, undersampling, or SMOTE before splitting the data. This can leak information from validation or test data into training.
High-Risk Mistake: Never apply SMOTE or oversampling before train-test split. Synthetic or duplicated examples can leak patterns from test data into training, making performance look better than it really is.
Leakage-Safe Imbalance Handling Workflow
Evaluation Metrics for Imbalanced Data
Imbalanced datasets require metrics that focus on minority-class detection and error trade-offs. Accuracy alone is usually not enough.
| Metric | Meaning | Best Used When |
|---|---|---|
| Confusion Matrix | Shows true positives, false positives, true negatives, and false negatives. | You want to understand error types clearly. |
| Precision | Of predicted positives, how many were truly positive? | False positives are costly. |
| Recall | Of actual positives, how many did the model catch? | False negatives are costly. |
| F1 Score | Balance between precision and recall. | Both false positives and false negatives matter. |
| PR-AUC | Area under the precision-recall curve. | Positive class is rare and important. |
| ROC-AUC | Measures ranking ability across thresholds. | You want general class separation, but use carefully under heavy imbalance. |
| Balanced Accuracy | Average of recall across classes. | You want performance that accounts for both majority and minority classes. |
Choosing the Right Strategy
| Situation | Recommended Strategy | Reason |
|---|---|---|
| Very large majority class | Undersampling or class weights. | Reduces training burden or increases minority attention. |
| Small but reliable minority class | Oversampling or SMOTE. | Gives the model more minority examples to learn from. |
| High cost of false negatives | Lower threshold, class weights, recall-focused metric. | Catches more positive cases. |
| High cost of false positives | Higher threshold, precision-focused metric. | Reduces unnecessary positive alerts. |
| Need probability quality | Calibration check and original-distribution validation. | Resampling may affect probability calibration. |
Example: Fraud Detection
Business Problem
A payment company wants to detect fraudulent transactions. Only 0.5% of transactions are fraud. Missing fraud is costly, but too many false alarms can also frustrate customers.
| Step | Action | Reason |
|---|---|---|
| 1 | Use stratified train-validation-test split. | Preserve fraud ratio across splits. |
| 2 | Apply class weights or SMOTE only on training data. | Improve minority learning without leaking test data. |
| 3 | Evaluate using recall, precision, F1, and PR-AUC. | Accuracy is misleading under heavy imbalance. |
| 4 | Tune threshold based on fraud investigation capacity. | Balance fraud detection with false alert workload. |
| 5 | Monitor model performance after deployment. | Fraud patterns may change over time. |
Example: Customer Churn Prediction
Retention Problem
A subscription company wants to identify customers likely to churn. Only 12% of customers churn in a month. If the model predicts everyone as non-churn, accuracy may be high but retention value will be low.
- Class weights: Give more importance to churners during model training.
- Threshold tuning: Lower threshold to identify more at-risk customers.
- Precision-recall balance: Avoid offering discounts to too many customers who would not churn.
- Business constraint: Choose threshold based on retention budget and contact capacity.
Example: Loan Default Prediction
Credit Risk Problem
A bank wants to predict default risk. Defaults are less frequent than successful repayments, but false negatives are costly because approving a risky borrower can create financial loss.
- Class weights: Penalize default misclassification more heavily.
- Threshold selection: Choose a risk threshold for approval, rejection, or manual review.
- Metrics: Use recall for default class, precision, ROC-AUC, PR-AUC, and confusion matrix.
- Governance: Check fairness, stability, and explainability before deployment.
Common Mistakes in Handling Imbalanced Data
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Using accuracy alone | High accuracy can hide poor minority-class detection. | Use precision, recall, F1, PR-AUC, and confusion matrix. |
| Applying SMOTE before train-test split | Creates leakage and overestimates performance. | Split first, then apply SMOTE only on training data. |
| Balancing validation or test sets artificially | Evaluation no longer reflects real-world class distribution. | Keep validation and test sets close to real distribution. |
| Oversampling noisy minority examples | Model may learn noise as if it were signal. | Clean data and inspect minority cases before oversampling. |
| Ignoring threshold selection | Default 0.5 threshold may not match business cost. | Choose threshold using validation data and business trade-offs. |
| Assuming resampling fixes everything | Feature quality, model choice, and validation still matter. | Combine imbalance handling with good feature engineering and evaluation. |
Best Practices for Imbalanced Classification
Imbalanced Dataset Checklist
- Understand class ratio: Always check how many examples exist in each class.
- Use stratified splitting: Preserve class distribution across train, validation, and test sets.
- Do not rely on accuracy alone: Use recall, precision, F1, PR-AUC, ROC-AUC, and confusion matrix.
- Apply resampling only on training data: Avoid leakage into validation or test data.
- Try class weights first when suitable: They avoid changing the actual dataset distribution.
- Use SMOTE carefully: Check whether synthetic samples make business sense.
- Tune thresholds: Match the prediction cutoff to business costs and operational capacity.
- Evaluate on original distribution: Validation and test data should reflect real-world class proportions.
- Monitor after deployment: Minority-class patterns may drift over time.
Why Imbalance Handling is a Business Decision
Handling imbalance is not only a technical task. It is also a business decision because different errors have different costs. In fraud detection, missing fraud may be worse than investigating a false alert. In marketing, contacting too many low-risk customers may waste budget.
The right approach depends on what the business wants to optimize: catching more positives, reducing false alarms, improving ranking quality, protecting customer experience, or minimizing financial loss.
Practical Insight: The goal is not always to perfectly balance the dataset. The real goal is to build a model that makes better decisions under real-world class imbalance and business constraints.
Key Takeaways
- Imbalanced datasets occur when one class is much more frequent than another.
- Accuracy can be misleading because the model may ignore the minority class.
- Important imbalance-handling methods include undersampling, oversampling, SMOTE, class weights, and threshold tuning.
- SMOTE creates synthetic minority examples instead of simply duplicating existing ones.
- Class weights penalize minority-class errors more heavily during training.
- Threshold tuning controls the trade-off between precision and recall.
- Resampling should be applied only to training data, never before train-test split.
- Use metrics such as recall, precision, F1, PR-AUC, balanced accuracy, and confusion matrix.
- The best strategy depends on business cost, class ratio, model type, and operational constraints.