Tree-Based Models: Decision Trees and Random Forest
Tree-based models are powerful machine learning algorithms that make predictions by splitting data into smaller and smaller groups. They are popular because they can capture non-linear relationships, handle interactions between features, and work well for both regression and classification problems.
The two most important tree-based models to understand first are Decision Trees and Random Forests. A decision tree is a single tree of rules, while a random forest combines many trees to create a stronger and more stable model.
What are Tree-Based Models?
Tree-based models use a series of decision rules to make predictions. Each rule splits the data based on a feature value. The model keeps splitting until it reaches final groups, called leaf nodes, where predictions are made.
For example, a customer churn tree may first ask whether customer tenure is less than 6 months. Then it may ask whether monthly charges are high. Then it may ask whether the customer has raised support tickets. Based on these answers, the model predicts churn risk.
Core Idea: Tree-based models divide data into meaningful segments using decision rules, then make predictions based on the behaviour of observations inside each segment.
Decision Tree Intuition
A decision tree works like a flowchart. At every step, it asks a question about the data. Based on the answer, the observation moves to the left or right branch. Eventually, it reaches a leaf node where the prediction is made.
Visual Idea of Tree-Based Models
Key Parts of a Decision Tree
| Tree Component | Meaning | Example |
|---|---|---|
| Root Node | The first split in the tree. | Customer tenure < 6 months? |
| Internal Node | A decision point inside the tree. | Monthly charges > ₹1,000? |
| Branch | The path created by a decision. | Yes branch or No branch. |
| Leaf Node | The final node where prediction is made. | Predict high churn risk. |
| Depth | The number of levels in the tree. | A depth of 4 means four levels of decisions. |
| Split | A rule that divides data into groups. | Age < 30, income > ₹50,000, city = Delhi. |
How Decision Trees Make Splits
A decision tree chooses splits that make the resulting groups more pure or more useful for prediction. In classification, purity means that a group mostly contains one class. In regression, the tree tries to reduce variation in the target value within each group.
| Problem Type | Common Split Criteria | Goal |
|---|---|---|
| Classification Classification Tree |
Gini impurity, entropy, information gain. | Create groups that are mostly one class. |
| Regression Regression Tree |
Mean squared error, mean absolute error. | Create groups with similar target values. |
Decision Tree for Classification
In classification, a decision tree predicts a category or class. Examples include churn or no churn, fraud or not fraud, default or no default, and approved or rejected.
The tree creates splits that separate classes as clearly as possible. At the final leaf node, the predicted class is usually the majority class in that leaf.
Decision Tree for Regression
In regression, a decision tree predicts a continuous numerical value. Examples include house price, sales revenue, customer spend, demand, delivery time, or insurance claim amount.
The tree creates groups where the target values are similar. At the final leaf node, the prediction is usually the average target value of training observations in that leaf.
Advantages of Decision Trees
Limitations of Decision Trees
Decision trees are easy to understand, but they can overfit. A deep tree may memorize training data instead of learning general patterns. This can lead to excellent training performance but poor test performance.
- Can overfit if allowed to grow too deep.
- Small data changes can create different trees.
- Single trees may be unstable.
- Predictions may be less smooth in regression problems.
- Limit maximum depth.
- Set minimum samples per leaf.
- Set minimum samples required to split.
- Use pruning.
- Use Random Forest instead of a single tree.
Important Decision Tree Hyperparameters
| Hyperparameter | Meaning | Effect |
|---|---|---|
| max_depth | Maximum depth of the tree. | Lower depth reduces overfitting but may underfit if too low. |
| min_samples_split | Minimum observations needed to split a node. | Higher values make the tree simpler. |
| min_samples_leaf | Minimum observations required in a leaf node. | Prevents tiny leaves that memorize training data. |
| max_features | Maximum number of features considered at each split. | Useful for randomness in ensemble methods. |
| criterion | Split quality measure. | Examples include Gini, entropy, and squared error. |
What is Random Forest?
Random Forest is an ensemble model that combines many decision trees. Instead of relying on one tree, it builds multiple trees on different samples of the data and combines their predictions.
For classification, a random forest usually predicts by majority vote. For regression, it predicts by averaging the predictions of all trees.
Core Idea: A random forest reduces the instability of a single decision tree by averaging the decisions of many different trees.
How Random Forest Works
Random Forest uses two important ideas: bagging and random feature selection. Bagging means each tree is trained on a random sample of the training data. Random feature selection means each split considers only a random subset of features.
Random Forest Training Process
Bagging Explained
Bagging stands for bootstrap aggregating. It means creating multiple random samples from the training data and training a separate model on each sample.
Because each tree sees a slightly different version of the training data, the trees become different from each other. Combining them reduces variance and makes the final model more stable.
Decision Tree vs Random Forest
| Aspect | Decision Tree | Random Forest |
|---|---|---|
| Model Structure | Single tree. | Many trees combined. |
| Interpretability | High, especially if tree is small. | Lower than single tree, but feature importance helps. |
| Overfitting Risk | High if tree is deep. | Lower because predictions are averaged across many trees. |
| Performance | Can be good but unstable. | Usually stronger and more stable. |
| Training Time | Fast. | Slower because many trees are trained. |
| Best Use | Explainable rule-based modelling and simple baselines. | Stronger predictive performance on tabular data. |
Advantages of Random Forest
Limitations of Random Forest
- Less interpretable than a single decision tree.
- Can be slower with many trees and large datasets.
- Feature importance can be biased toward certain variable types.
- May not extrapolate well beyond the range of training data in regression.
- Use feature importance and partial dependence plots.
- Tune number of trees and tree depth.
- Use cross-validation for reliable evaluation.
- Compare with simpler models when interpretability matters.
Important Random Forest Hyperparameters
| Hyperparameter | Meaning | Practical Impact |
|---|---|---|
| n_estimators | Number of trees in the forest. | More trees usually improve stability but increase training time. |
| max_depth | Maximum depth of each tree. | Controls overfitting and model complexity. |
| min_samples_leaf | Minimum samples in each leaf. | Higher values make predictions smoother and reduce overfitting. |
| max_features | Number of features considered at each split. | Adds randomness and reduces correlation between trees. |
| bootstrap | Whether trees are trained on bootstrapped samples. | Enables bagging and out-of-bag evaluation. |
Feature Importance in Tree-Based Models
Tree-based models can estimate feature importance by measuring how much each feature contributes to reducing prediction error or improving split quality across trees.
Feature importance is useful for interpretation, feature selection, and business insight. However, it should not be treated as perfect truth because importance scores can be affected by correlated features and variable types.
Important: Feature importance tells us which variables were useful to the model, but it does not automatically prove causation. Always interpret importance with business logic.
Tree-Based Models and Feature Scaling
Decision Trees and Random Forests usually do not require feature scaling. This is because trees split data using thresholds and ordering, not distance or gradient-based coefficient optimization.
For example, whether income is measured in rupees or thousands of rupees usually does not change the order of observations, so the split logic remains similar.
Handling Categorical Variables
Many tree implementations still require categorical variables to be encoded numerically before training. One-hot encoding, ordinal encoding, target encoding, or frequency encoding may be used depending on the variable type, cardinality, and model library.
| Categorical Situation | Possible Encoding | Note |
|---|---|---|
| Low-cardinality nominal variable | One-hot encoding. | Useful for variables such as payment method or contract type. |
| Ordinal variable | Ordinal encoding. | Use only when the order is meaningful. |
| High-cardinality variable | Frequency encoding or target encoding. | Target encoding must be done carefully to avoid leakage. |
Example: Customer Churn Prediction
Business Problem
A telecom company wants to predict whether customers will churn. The dataset contains customer tenure, monthly charges, payment method, support tickets, contract type, data usage, and churn status.
| Model | How It Helps | Business Interpretation |
|---|---|---|
| Decision Tree Decision Tree |
Creates simple churn rules based on tenure, charges, and complaints. | Easy to explain as rule paths such as “new customer + high charges + many complaints = high churn risk”. |
| Random Forest Random Forest |
Combines many trees for stronger prediction. | May identify tenure, complaints, and payment delays as the most important churn drivers. |
Example: Loan Default Prediction
Classification Problem
A bank wants to predict whether a loan applicant may default. Tree-based models can capture complex non-linear relationships between income, loan amount, credit score, employment type, repayment history, and debt-to-income ratio.
- Decision Tree: Useful for explainable credit decision rules.
- Random Forest: Useful for better accuracy and stability.
- Feature Importance: Helps identify which risk factors matter most.
- Validation: Important to avoid overfitting and ensure fair generalization.
Example: House Price Prediction
Regression Problem
A real estate company wants to predict house prices. Decision trees can split houses by location, area, number of rooms, property age, and amenities. Random Forest can combine many such trees to produce a more stable prediction.
- Tree Advantage: Captures non-linear price jumps by location and property size.
- Forest Advantage: Reduces overfitting compared to one deep tree.
- No Scaling Needed: Area, age, distance, and price-related features do not usually need scaling for tree splits.
- Limitation: Random Forest may struggle to extrapolate prices beyond the range seen in training data.
When to Use Decision Trees
- You need a simple and explainable model.
- Business users want rule-based interpretation.
- The dataset has non-linear patterns.
- You want a quick baseline model.
- The feature count is manageable.
- The tree grows very deep and overfits.
- The model is unstable across data samples.
- Prediction accuracy matters more than simple explainability.
- The dataset is noisy and small.
When to Use Random Forest
- You want stronger performance than a single tree.
- The dataset is tabular and has mixed feature types.
- There are non-linear relationships and interactions.
- You want feature importance estimates.
- You want a robust baseline for structured data.
- You need very simple explanation for every prediction.
- The dataset is extremely large and training time matters.
- You need smooth extrapolation beyond training values.
- Feature importance must be interpreted causally.
Common Mistakes with Tree-Based Models
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Allowing a decision tree to grow too deep | The tree may memorize training data and overfit. | Limit max depth, use pruning, or increase minimum samples per leaf. |
| Assuming Random Forest is always interpretable | Many trees are harder to explain than one tree. | Use feature importance, partial dependence, and simpler surrogate rules when needed. |
| Ignoring class imbalance | The model may perform poorly on minority classes. | Use class weights, balanced sampling, or suitable metrics such as recall and F1. |
| Trusting feature importance blindly | Importance can be biased or affected by correlated variables. | Validate importance using business logic and additional methods. |
| Not tuning hyperparameters | Default settings may underfit or overfit. | Tune depth, number of trees, min samples, and max features using validation data. |
Best Practices for Tree-Based Models
Tree-Based Model Checklist
- Start with a simple decision tree: It helps understand rule-based patterns in the data.
- Control tree complexity: Tune max depth, min samples split, and min samples leaf.
- Use Random Forest for stronger performance: It reduces variance compared to a single tree.
- Evaluate using validation data: Do not judge performance only on training data.
- Check feature importance: Use it for insight, not automatic causation.
- Handle categorical variables properly: Encode categories based on cardinality and business meaning.
- Use suitable metrics: Regression and classification require different evaluation metrics.
- Watch for class imbalance: Accuracy alone may be misleading in classification problems.
- Compare with linear models: Tree-based models are powerful, but simpler models may be easier to explain.
Why Tree-Based Models are Important
Tree-based models are important because they work well on many real-world tabular datasets. They can capture non-linear patterns, interactions, thresholds, and segment-level behaviour without requiring heavy mathematical assumptions.
Decision Trees are useful for interpretability and rule-based explanation. Random Forests are useful for stronger predictive performance and stability. Together, they form a foundation for understanding more advanced ensemble models such as Gradient Boosting, XGBoost, LightGBM, and CatBoost.
Practical Insight: If linear regression is the classic baseline for regression, tree-based models are often the practical baseline for real-world tabular machine learning problems.
Key Takeaways
- Tree-based models make predictions using rule-based splits.
- Decision Trees are easy to interpret but can overfit if too deep.
- Random Forest combines many decision trees to improve stability and reduce overfitting.
- Classification trees predict categories; regression trees predict numerical values.
- Tree models can capture non-linear relationships and feature interactions.
- Tree-based models usually do not require feature scaling.
- Important hyperparameters include max depth, number of trees, min samples leaf, and max features.
- Feature importance helps interpretation but should not be treated as proof of causation.
- Random Forest is often a strong baseline for structured tabular predictive modelling.