Encoding Categorical Variables: One-Hot, Label, and Target Encoding

Many real-world datasets contain categorical variables such as city, gender, payment method, product category, education level, customer segment, and contract type. Machine learning models usually require numerical input, so these categories must be converted into numbers before modelling.

This conversion process is called categorical encoding. Choosing the right encoding method is important because poor encoding can confuse the model, create false order, increase dimensionality, or even cause data leakage.

What are Categorical Variables?

Categorical variables represent groups, labels, or categories instead of continuous numerical values. They describe qualitative properties of an observation.

For example, payment method may contain categories such as UPI, credit card, debit card, wallet, and cash. These values cannot be directly understood by most machine learning algorithms unless they are encoded numerically.

Core Idea: Encoding converts category labels into numerical representations so that machine learning models can use them as input features.

Types of Categorical Variables

Type Meaning Examples Encoding Consideration
Nominal Categorical Categories have no natural order. City, gender, product category, payment method. One-hot encoding is often suitable.
Ordinal Categorical Categories have meaningful order. Low/Medium/High, education level, satisfaction rating. Ordinal or label encoding may be suitable if order is meaningful.
High-Cardinality Categorical Variable has many unique categories. PIN code, product ID, customer city, merchant ID. Target, frequency, grouping, or embedding-style approaches may be needed.
Binary Categorical Only two categories. Yes/No, Active/Inactive, Male/Female. Can be encoded as 0 and 1.

Why Encoding Matters in Predictive Modelling

🤖
Models Need Numbers
Most algorithms cannot directly process text categories such as “Delhi” or “Credit Card”.
🎯
Categories Carry Signals
Product type, location, contract type, and customer segment often strongly influence predictions.
⚖️
Wrong Encoding Can Mislead
Assigning numbers to unordered categories can create false ranking and distort model learning.
🚨
Some Encodings Can Leak
Target encoding can leak target information if not performed carefully inside training folds.

Common Encoding Techniques

Visual Overview of Encoding Methods

One-Hot Encoding
City
Delhi
Mumbai
Kolkata
Delhi
1
0
0
Mumbai
0
1
0
Kolkata
0
0
1
Label / Ordinal Encoding
Low
1
Medium
2
High
3
Target Encoding
0.18
0.31
0.47

1. One-Hot Encoding

One-hot encoding creates a separate binary column for each category. Each new column contains 1 if the observation belongs to that category and 0 otherwise.

This method is commonly used for nominal categorical variables where categories do not have natural order.

Original Payment Method UPI Card Cash Wallet
UPI 1 0 0 0
Card 0 1 0 0
Cash 0 0 1 0
Wallet 0 0 0 1
Use One-Hot Encoding When
  • Categories are nominal with no natural order.
  • The number of unique categories is small or moderate.
  • You are using linear models, logistic regression, SVM, or neural networks.
  • You want to avoid false ordering between categories.
Be Careful When
  • The variable has hundreds or thousands of categories.
  • The dataset becomes too wide after encoding.
  • Many categories are rare or appear only once.
  • You need memory-efficient modelling.

2. Label Encoding

Label encoding assigns a unique integer to each category. For example, Red = 0, Blue = 1, Green = 2. This creates one numerical column instead of many binary columns.

Label encoding is simple, but it can be risky for nominal variables because the model may incorrectly assume an order or distance between categories.

Original Category Label Encoded Value Interpretation Risk
Delhi 0 Model may wrongly treat Delhi as lower than Mumbai.
Mumbai 1 Number is only a code, not a real rank.
Kolkata 2 Model may wrongly assume Kolkata is greater than Delhi.

Important: Label encoding should not be used blindly for unordered categories in models that interpret numerical order, such as linear regression or logistic regression.

3. Ordinal Encoding

Ordinal encoding is similar to label encoding, but the assigned numbers follow a real meaningful order. This is suitable when categories naturally represent levels or ranks.

Original Category Ordinal Encoded Value Why It Makes Sense
Low 1 Lowest risk or intensity level.
Medium 2 Middle level.
High 3 Highest risk or intensity level.

Ordinal encoding is useful for variables such as education level, satisfaction rating, risk category, product quality grade, and customer priority level.

4. Target Encoding

Target encoding replaces each category with a value based on the average target value for that category. It is especially useful for high-cardinality categorical variables where one-hot encoding would create too many columns.

For example, in a churn prediction problem, each city can be replaced by the churn rate of customers from that city.

City Number of Customers Churn Rate Target Encoded Value
Delhi 1,000 18% 0.18
Mumbai 850 31% 0.31
Kolkata 300 47% 0.47

High-Risk Warning: Target encoding can easily cause data leakage if target averages are calculated using validation or test data. It must be fitted only on training data, and preferably using cross-validation or smoothing.

Target Encoding and Leakage

Target encoding uses the target variable to create a feature. This makes it powerful but risky. If done incorrectly, the model may indirectly learn the answer from the target itself.

Approach Safe or Risky? Reason
Calculate target encoding before train-test split Risky Test data information leaks into training features.
Fit target encoding only on training data Safer Validation and test data remain unseen during encoding calculation.
Use out-of-fold target encoding Best Practice Each training row receives an encoded value calculated without using its own target.
Use smoothing for rare categories Recommended Reduces overfitting for categories with very few observations.

5. Frequency Encoding

Frequency encoding replaces each category with the number of times it appears in the dataset, or with its percentage frequency. It is useful when category popularity itself carries predictive meaning.

For example, a commonly used product category may behave differently from a rare product category. Similarly, popular cities or high-volume merchants may have more stable behavioural patterns.

Product Category Frequency Frequency Encoded Value
Electronics 12,000 12000
Books 4,500 4500
Luxury Watches 250 250

Comparing Encoding Methods

Encoding Method Best For Advantages Limitations
One-Hot
One-Hot Encoding
Nominal variables with few categories. No false ordering; easy to interpret. Creates many columns for high-cardinality variables.
Label
Label Encoding
Tree-based models or category IDs in some cases. Simple and compact. Can create false order in unordered categories.
Ordinal
Ordinal Encoding
Ordered categories. Preserves meaningful order. Assumes distances between levels are meaningful.
Target
Target Encoding
High-cardinality categorical variables. Compact and can capture target relationship. High leakage and overfitting risk if done incorrectly.
Frequency
Frequency Encoding
Variables where category popularity matters. Compact and simple. Different categories with same frequency become indistinguishable.

Choosing the Right Encoding Method

Encoding Decision Flow

Identify Category Type
Check Cardinality
Check Model Type
Assess Leakage Risk
Validate Performance
Use One-Hot Encoding When
  • Categories are unordered.
  • Unique category count is low or moderate.
  • You want interpretability.
  • You are using linear or distance-based models.
Use Ordinal Encoding When
  • Categories have real order.
  • Business meaning supports ranking.
  • Examples include low, medium, high.
  • The order should be defined carefully.
Use Target Encoding When
  • There are many unique categories.
  • Category has strong relationship with target.
  • You use cross-validation or out-of-fold encoding.
  • You apply smoothing for rare categories.
Use Frequency Encoding When
  • Category frequency itself may be predictive.
  • You need compact representation.
  • High-cardinality variable is difficult to one-hot encode.
  • You want a simple alternative to target encoding.

Example: Encoding for Customer Churn Prediction

Business Problem

A telecom company wants to predict customer churn. The dataset contains categorical variables such as contract type, payment method, city, customer segment, and satisfaction level.

Categorical Variable Variable Type Recommended Encoding Reason
Contract Type Nominal, low-cardinality. One-hot encoding. No natural order between monthly, yearly, and two-year contracts.
Payment Method Nominal, low-cardinality. One-hot encoding. Payment method categories are unordered.
City High-cardinality. Target encoding or frequency encoding. One-hot encoding may create too many columns.
Satisfaction Level Ordinal. Ordinal encoding. Low, medium, and high satisfaction have meaningful order.
Customer Segment Nominal or ordinal depending on definition. One-hot or ordinal encoding. Encoding depends on whether the segment has true ranking.

Example: Encoding for House Price Prediction

Regression Problem

A real estate company wants to predict house prices. The dataset contains location, property type, furnishing status, builder name, and property condition.

  • Location: High-cardinality; target encoding may be useful but must be done carefully.
  • Property Type: Apartment, villa, plot; one-hot encoding is usually suitable.
  • Furnishing Status: Unfurnished, semi-furnished, fully furnished; ordinal encoding may be suitable if business order is meaningful.
  • Builder Name: High-cardinality; frequency or target encoding may be considered.
  • Property Condition: Poor, average, good, excellent; ordinal encoding may be useful.

Handling Rare and Unknown Categories

Real-world data often contains rare categories and new categories that appear during prediction but were not present during training. These must be handled carefully.

Problem Example Recommended Treatment
Rare Categories A city appears only 3 times in training data. Group rare categories into “Other”.
Unknown Category in Test Data A new payment method appears after deployment. Use encoders that can handle unknown categories safely.
Too Many One-Hot Columns Product ID has 50,000 unique values. Use frequency encoding, target encoding, grouping, or advanced embeddings.
Inconsistent Category Labels “Bangalore”, “Bengaluru”, and “BLR” used together. Standardize categories before encoding.

Encoding and Train-Test Split

Encoding should be fitted using only the training data and then applied to validation and test data. This avoids data leakage and makes evaluation more realistic.

Safe Workflow: Split the dataset first, fit the encoder on training data only, then transform validation and test data using the fitted encoder.

Safe Encoding Pipeline

Split Data
Fit Encoder on Training Set
Transform Validation Set
Transform Test Set
Train and Evaluate Model

Common Mistakes in Categorical Encoding

Mistake Why It Is Harmful Better Approach
Using label encoding for unordered categories Creates false numerical order. Use one-hot encoding for nominal variables when cardinality is manageable.
One-hot encoding very high-cardinality variables Creates too many sparse columns. Use grouping, frequency encoding, or target encoding.
Target encoding before data split Leaks target information into training features. Use training-only or out-of-fold target encoding.
Ignoring unknown categories Model may fail when new categories appear in production. Use an “Unknown” or “Other” strategy.
Not standardizing category names Same category may be treated as multiple categories. Clean and standardize labels before encoding.

Best Practices for Encoding Categorical Variables

Categorical Encoding Checklist

  • Identify variable type: Check whether the category is nominal, ordinal, binary, or high-cardinality.
  • Clean categories first: Standardize spelling, spacing, case, and labels before encoding.
  • Use one-hot encoding for unordered low-cardinality variables: Avoid false ordering.
  • Use ordinal encoding only when order is meaningful: Define the order using business logic.
  • Be careful with label encoding: It can mislead models that treat numbers as ordered values.
  • Use target encoding safely: Fit only on training data and use out-of-fold encoding when possible.
  • Handle rare categories: Group rare values into “Other” if needed.
  • Prepare for unknown categories: New categories may appear in validation, test, or production data.
  • Validate model performance: Compare encoding methods using validation data.

Why Encoding is a Strategic Modelling Decision

Encoding is not just a technical preprocessing step. It directly affects how the model understands categories. The same categorical variable may need different encoding depending on the number of categories, model type, business meaning, and leakage risk.

Good encoding preserves useful information, avoids false assumptions, controls dimensionality, and improves model reliability.

Practical Insight: The best encoding method is the one that represents category information correctly without adding leakage, unnecessary complexity, or false order.

Key Takeaways

  • Categorical variables must usually be converted into numerical format before modelling.
  • One-hot encoding is suitable for nominal variables with manageable category counts.
  • Label encoding assigns integer codes but may create false order for unordered categories.
  • Ordinal encoding should be used only when categories have a meaningful rank.
  • Target encoding can be useful for high-cardinality variables but has high leakage risk.
  • Frequency encoding replaces categories with their occurrence counts or proportions.
  • Rare and unknown categories should be handled carefully.
  • Encoders should be fitted on training data only and then applied to validation and test data.