Tree-Based Models: Decision Trees and Random Forest

Tree-based models are powerful machine learning algorithms that make predictions by splitting data into smaller and smaller groups. They are popular because they can capture non-linear relationships, handle interactions between features, and work well for both regression and classification problems.

The two most important tree-based models to understand first are Decision Trees and Random Forests. A decision tree is a single tree of rules, while a random forest combines many trees to create a stronger and more stable model.

What are Tree-Based Models?

Tree-based models use a series of decision rules to make predictions. Each rule splits the data based on a feature value. The model keeps splitting until it reaches final groups, called leaf nodes, where predictions are made.

For example, a customer churn tree may first ask whether customer tenure is less than 6 months. Then it may ask whether monthly charges are high. Then it may ask whether the customer has raised support tickets. Based on these answers, the model predicts churn risk.

Core Idea: Tree-based models divide data into meaningful segments using decision rules, then make predictions based on the behaviour of observations inside each segment.

Decision Tree Intuition

A decision tree works like a flowchart. At every step, it asks a question about the data. Based on the answer, the observation moves to the left or right branch. Eventually, it reaches a leaf node where the prediction is made.

Visual Idea of Tree-Based Models

Decision Tree
Tenure < 6?
Charges High?
Low Risk
High Risk
Medium Risk
Random Forest
Feature Importance
Tenure
Charges
Tickets
Region

Key Parts of a Decision Tree

Tree Component Meaning Example
Root Node The first split in the tree. Customer tenure < 6 months?
Internal Node A decision point inside the tree. Monthly charges > ₹1,000?
Branch The path created by a decision. Yes branch or No branch.
Leaf Node The final node where prediction is made. Predict high churn risk.
Depth The number of levels in the tree. A depth of 4 means four levels of decisions.
Split A rule that divides data into groups. Age < 30, income > ₹50,000, city = Delhi.

How Decision Trees Make Splits

A decision tree chooses splits that make the resulting groups more pure or more useful for prediction. In classification, purity means that a group mostly contains one class. In regression, the tree tries to reduce variation in the target value within each group.

Problem Type Common Split Criteria Goal
Classification
Classification Tree
Gini impurity, entropy, information gain. Create groups that are mostly one class.
Regression
Regression Tree
Mean squared error, mean absolute error. Create groups with similar target values.

Decision Tree for Classification

In classification, a decision tree predicts a category or class. Examples include churn or no churn, fraud or not fraud, default or no default, and approved or rejected.

The tree creates splits that separate classes as clearly as possible. At the final leaf node, the predicted class is usually the majority class in that leaf.

Decision Tree for Regression

In regression, a decision tree predicts a continuous numerical value. Examples include house price, sales revenue, customer spend, demand, delivery time, or insurance claim amount.

The tree creates groups where the target values are similar. At the final leaf node, the prediction is usually the average target value of training observations in that leaf.

Advantages of Decision Trees

🔍
Easy to Interpret
A decision tree can be explained as a sequence of simple if-then rules.
〰️
Handles Non-Linearity
Trees can capture non-linear relationships without requiring explicit transformations.
🔗
Captures Interactions
Trees naturally capture interactions between variables through sequential splits.
⚙️
Little Scaling Required
Decision trees generally do not require feature scaling because they split by thresholds.

Limitations of Decision Trees

Decision trees are easy to understand, but they can overfit. A deep tree may memorize training data instead of learning general patterns. This can lead to excellent training performance but poor test performance.

Common Weaknesses
  • Can overfit if allowed to grow too deep.
  • Small data changes can create different trees.
  • Single trees may be unstable.
  • Predictions may be less smooth in regression problems.
How to Control Overfitting
  • Limit maximum depth.
  • Set minimum samples per leaf.
  • Set minimum samples required to split.
  • Use pruning.
  • Use Random Forest instead of a single tree.

Important Decision Tree Hyperparameters

Hyperparameter Meaning Effect
max_depth Maximum depth of the tree. Lower depth reduces overfitting but may underfit if too low.
min_samples_split Minimum observations needed to split a node. Higher values make the tree simpler.
min_samples_leaf Minimum observations required in a leaf node. Prevents tiny leaves that memorize training data.
max_features Maximum number of features considered at each split. Useful for randomness in ensemble methods.
criterion Split quality measure. Examples include Gini, entropy, and squared error.

What is Random Forest?

Random Forest is an ensemble model that combines many decision trees. Instead of relying on one tree, it builds multiple trees on different samples of the data and combines their predictions.

For classification, a random forest usually predicts by majority vote. For regression, it predicts by averaging the predictions of all trees.

Core Idea: A random forest reduces the instability of a single decision tree by averaging the decisions of many different trees.

How Random Forest Works

Random Forest uses two important ideas: bagging and random feature selection. Bagging means each tree is trained on a random sample of the training data. Random feature selection means each split considers only a random subset of features.

Random Forest Training Process

Draw Random Samples
Train Many Trees
Use Random Feature Subsets
Combine Predictions
Final Output

Bagging Explained

Bagging stands for bootstrap aggregating. It means creating multiple random samples from the training data and training a separate model on each sample.

Because each tree sees a slightly different version of the training data, the trees become different from each other. Combining them reduces variance and makes the final model more stable.

Decision Tree vs Random Forest

Aspect Decision Tree Random Forest
Model Structure Single tree. Many trees combined.
Interpretability High, especially if tree is small. Lower than single tree, but feature importance helps.
Overfitting Risk High if tree is deep. Lower because predictions are averaged across many trees.
Performance Can be good but unstable. Usually stronger and more stable.
Training Time Fast. Slower because many trees are trained.
Best Use Explainable rule-based modelling and simple baselines. Stronger predictive performance on tabular data.

Advantages of Random Forest

🌲
Strong Predictive Power
Random Forest often performs well on structured tabular datasets.
🛡️
Reduces Overfitting
Averaging many trees reduces the instability of a single decision tree.
🔎
Feature Importance
Random Forest can estimate which features are most useful for prediction.
⚙️
Minimal Scaling Needs
Like decision trees, random forests usually do not require feature scaling.

Limitations of Random Forest

Main Limitations
  • Less interpretable than a single decision tree.
  • Can be slower with many trees and large datasets.
  • Feature importance can be biased toward certain variable types.
  • May not extrapolate well beyond the range of training data in regression.
How to Manage Them
  • Use feature importance and partial dependence plots.
  • Tune number of trees and tree depth.
  • Use cross-validation for reliable evaluation.
  • Compare with simpler models when interpretability matters.

Important Random Forest Hyperparameters

Hyperparameter Meaning Practical Impact
n_estimators Number of trees in the forest. More trees usually improve stability but increase training time.
max_depth Maximum depth of each tree. Controls overfitting and model complexity.
min_samples_leaf Minimum samples in each leaf. Higher values make predictions smoother and reduce overfitting.
max_features Number of features considered at each split. Adds randomness and reduces correlation between trees.
bootstrap Whether trees are trained on bootstrapped samples. Enables bagging and out-of-bag evaluation.

Feature Importance in Tree-Based Models

Tree-based models can estimate feature importance by measuring how much each feature contributes to reducing prediction error or improving split quality across trees.

Feature importance is useful for interpretation, feature selection, and business insight. However, it should not be treated as perfect truth because importance scores can be affected by correlated features and variable types.

Important: Feature importance tells us which variables were useful to the model, but it does not automatically prove causation. Always interpret importance with business logic.

Tree-Based Models and Feature Scaling

Decision Trees and Random Forests usually do not require feature scaling. This is because trees split data using thresholds and ordering, not distance or gradient-based coefficient optimization.

For example, whether income is measured in rupees or thousands of rupees usually does not change the order of observations, so the split logic remains similar.

Handling Categorical Variables

Many tree implementations still require categorical variables to be encoded numerically before training. One-hot encoding, ordinal encoding, target encoding, or frequency encoding may be used depending on the variable type, cardinality, and model library.

Categorical Situation Possible Encoding Note
Low-cardinality nominal variable One-hot encoding. Useful for variables such as payment method or contract type.
Ordinal variable Ordinal encoding. Use only when the order is meaningful.
High-cardinality variable Frequency encoding or target encoding. Target encoding must be done carefully to avoid leakage.

Example: Customer Churn Prediction

Business Problem

A telecom company wants to predict whether customers will churn. The dataset contains customer tenure, monthly charges, payment method, support tickets, contract type, data usage, and churn status.

Model How It Helps Business Interpretation
Decision Tree
Decision Tree
Creates simple churn rules based on tenure, charges, and complaints. Easy to explain as rule paths such as “new customer + high charges + many complaints = high churn risk”.
Random Forest
Random Forest
Combines many trees for stronger prediction. May identify tenure, complaints, and payment delays as the most important churn drivers.

Example: Loan Default Prediction

Classification Problem

A bank wants to predict whether a loan applicant may default. Tree-based models can capture complex non-linear relationships between income, loan amount, credit score, employment type, repayment history, and debt-to-income ratio.

  • Decision Tree: Useful for explainable credit decision rules.
  • Random Forest: Useful for better accuracy and stability.
  • Feature Importance: Helps identify which risk factors matter most.
  • Validation: Important to avoid overfitting and ensure fair generalization.

Example: House Price Prediction

Regression Problem

A real estate company wants to predict house prices. Decision trees can split houses by location, area, number of rooms, property age, and amenities. Random Forest can combine many such trees to produce a more stable prediction.

  • Tree Advantage: Captures non-linear price jumps by location and property size.
  • Forest Advantage: Reduces overfitting compared to one deep tree.
  • No Scaling Needed: Area, age, distance, and price-related features do not usually need scaling for tree splits.
  • Limitation: Random Forest may struggle to extrapolate prices beyond the range seen in training data.

When to Use Decision Trees

Use Decision Trees When
  • You need a simple and explainable model.
  • Business users want rule-based interpretation.
  • The dataset has non-linear patterns.
  • You want a quick baseline model.
  • The feature count is manageable.
Avoid or Control When
  • The tree grows very deep and overfits.
  • The model is unstable across data samples.
  • Prediction accuracy matters more than simple explainability.
  • The dataset is noisy and small.

When to Use Random Forest

Use Random Forest When
  • You want stronger performance than a single tree.
  • The dataset is tabular and has mixed feature types.
  • There are non-linear relationships and interactions.
  • You want feature importance estimates.
  • You want a robust baseline for structured data.
Be Careful When
  • You need very simple explanation for every prediction.
  • The dataset is extremely large and training time matters.
  • You need smooth extrapolation beyond training values.
  • Feature importance must be interpreted causally.

Common Mistakes with Tree-Based Models

Mistake Why It Is Harmful Better Approach
Allowing a decision tree to grow too deep The tree may memorize training data and overfit. Limit max depth, use pruning, or increase minimum samples per leaf.
Assuming Random Forest is always interpretable Many trees are harder to explain than one tree. Use feature importance, partial dependence, and simpler surrogate rules when needed.
Ignoring class imbalance The model may perform poorly on minority classes. Use class weights, balanced sampling, or suitable metrics such as recall and F1.
Trusting feature importance blindly Importance can be biased or affected by correlated variables. Validate importance using business logic and additional methods.
Not tuning hyperparameters Default settings may underfit or overfit. Tune depth, number of trees, min samples, and max features using validation data.

Best Practices for Tree-Based Models

Tree-Based Model Checklist

  • Start with a simple decision tree: It helps understand rule-based patterns in the data.
  • Control tree complexity: Tune max depth, min samples split, and min samples leaf.
  • Use Random Forest for stronger performance: It reduces variance compared to a single tree.
  • Evaluate using validation data: Do not judge performance only on training data.
  • Check feature importance: Use it for insight, not automatic causation.
  • Handle categorical variables properly: Encode categories based on cardinality and business meaning.
  • Use suitable metrics: Regression and classification require different evaluation metrics.
  • Watch for class imbalance: Accuracy alone may be misleading in classification problems.
  • Compare with linear models: Tree-based models are powerful, but simpler models may be easier to explain.

Why Tree-Based Models are Important

Tree-based models are important because they work well on many real-world tabular datasets. They can capture non-linear patterns, interactions, thresholds, and segment-level behaviour without requiring heavy mathematical assumptions.

Decision Trees are useful for interpretability and rule-based explanation. Random Forests are useful for stronger predictive performance and stability. Together, they form a foundation for understanding more advanced ensemble models such as Gradient Boosting, XGBoost, LightGBM, and CatBoost.

Practical Insight: If linear regression is the classic baseline for regression, tree-based models are often the practical baseline for real-world tabular machine learning problems.

Key Takeaways

  • Tree-based models make predictions using rule-based splits.
  • Decision Trees are easy to interpret but can overfit if too deep.
  • Random Forest combines many decision trees to improve stability and reduce overfitting.
  • Classification trees predict categories; regression trees predict numerical values.
  • Tree models can capture non-linear relationships and feature interactions.
  • Tree-based models usually do not require feature scaling.
  • Important hyperparameters include max depth, number of trees, min samples leaf, and max features.
  • Feature importance helps interpretation but should not be treated as proof of causation.
  • Random Forest is often a strong baseline for structured tabular predictive modelling.