Encoding Categorical Variables: One-Hot, Label, and Target Encoding

Many real-world datasets contain categorical variables such as city, gender, payment method, product category, education level, customer segment, and contract type. Machine learning models usually require numerical input, so these categories must be converted into numbers before modelling.

This conversion process is called categorical encoding. Choosing the right encoding method is important because poor encoding can confuse the model, create false order, increase dimensionality, or even cause data leakage.

What are Categorical Variables?

Categorical variables represent groups, labels, or categories instead of continuous numerical values. They describe qualitative properties of an observation.

For example, payment method may contain categories such as UPI, credit card, debit card, wallet, and cash. These values cannot be directly understood by most machine learning algorithms unless they are encoded numerically.

Core Idea: Encoding converts category labels into numerical representations so that machine learning models can use them as input features.

Types of Categorical Variables

Type	Meaning	Examples	Encoding Consideration
Nominal Categorical	Categories have no natural order.	City, gender, product category, payment method.	One-hot encoding is often suitable.
Ordinal Categorical	Categories have meaningful order.	Low/Medium/High, education level, satisfaction rating.	Ordinal or label encoding may be suitable if order is meaningful.
High-Cardinality Categorical	Variable has many unique categories.	PIN code, product ID, customer city, merchant ID.	Target, frequency, grouping, or embedding-style approaches may be needed.
Binary Categorical	Only two categories.	Yes/No, Active/Inactive, Male/Female.	Can be encoded as 0 and 1.

Why Encoding Matters in Predictive Modelling

🤖

Models Need Numbers

Most algorithms cannot directly process text categories such as “Delhi” or “Credit Card”.

🎯

Categories Carry Signals

Product type, location, contract type, and customer segment often strongly influence predictions.

⚖️

Wrong Encoding Can Mislead

Assigning numbers to unordered categories can create false ranking and distort model learning.

🚨

Some Encodings Can Leak

Target encoding can leak target information if not performed carefully inside training folds.

Common Encoding Techniques

Visual Overview of Encoding Methods

One-Hot Encoding

City

Delhi

Mumbai

Kolkata

Delhi

Mumbai

Kolkata

Label / Ordinal Encoding

Low

Medium

High

Target Encoding

0.18

0.31

0.47

1. One-Hot Encoding

One-hot encoding creates a separate binary column for each category. Each new column contains 1 if the observation belongs to that category and 0 otherwise.

This method is commonly used for nominal categorical variables where categories do not have natural order.

Original Payment Method	UPI	Card	Cash	Wallet
UPI	1	0	0	0
Card	0	1	0	0
Cash	0	0	1	0
Wallet	0	0	0	1

Use One-Hot Encoding When

Categories are nominal with no natural order.
The number of unique categories is small or moderate.
You are using linear models, logistic regression, SVM, or neural networks.
You want to avoid false ordering between categories.

Be Careful When

The variable has hundreds or thousands of categories.
The dataset becomes too wide after encoding.
Many categories are rare or appear only once.
You need memory-efficient modelling.

2. Label Encoding

Label encoding assigns a unique integer to each category. For example, Red = 0, Blue = 1, Green = 2. This creates one numerical column instead of many binary columns.

Label encoding is simple, but it can be risky for nominal variables because the model may incorrectly assume an order or distance between categories.

Original Category	Label Encoded Value	Interpretation Risk
Delhi	0	Model may wrongly treat Delhi as lower than Mumbai.
Mumbai	1	Number is only a code, not a real rank.
Kolkata	2	Model may wrongly assume Kolkata is greater than Delhi.

Important: Label encoding should not be used blindly for unordered categories in models that interpret numerical order, such as linear regression or logistic regression.

3. Ordinal Encoding

Ordinal encoding is similar to label encoding, but the assigned numbers follow a real meaningful order. This is suitable when categories naturally represent levels or ranks.

Original Category	Ordinal Encoded Value	Why It Makes Sense
Low	1	Lowest risk or intensity level.
Medium	2	Middle level.
High	3	Highest risk or intensity level.

Ordinal encoding is useful for variables such as education level, satisfaction rating, risk category, product quality grade, and customer priority level.

4. Target Encoding

Target encoding replaces each category with a value based on the average target value for that category. It is especially useful for high-cardinality categorical variables where one-hot encoding would create too many columns.

For example, in a churn prediction problem, each city can be replaced by the churn rate of customers from that city.

City	Number of Customers	Churn Rate	Target Encoded Value
Delhi	1,000	18%	0.18
Mumbai	850	31%	0.31
Kolkata	300	47%	0.47

High-Risk Warning: Target encoding can easily cause data leakage if target averages are calculated using validation or test data. It must be fitted only on training data, and preferably using cross-validation or smoothing.

Target Encoding and Leakage

Target encoding uses the target variable to create a feature. This makes it powerful but risky. If done incorrectly, the model may indirectly learn the answer from the target itself.

Approach	Safe or Risky?	Reason
Calculate target encoding before train-test split	Risky	Test data information leaks into training features.
Fit target encoding only on training data	Safer	Validation and test data remain unseen during encoding calculation.
Use out-of-fold target encoding	Best Practice	Each training row receives an encoded value calculated without using its own target.
Use smoothing for rare categories	Recommended	Reduces overfitting for categories with very few observations.

5. Frequency Encoding

Frequency encoding replaces each category with the number of times it appears in the dataset, or with its percentage frequency. It is useful when category popularity itself carries predictive meaning.

For example, a commonly used product category may behave differently from a rare product category. Similarly, popular cities or high-volume merchants may have more stable behavioural patterns.

Product Category	Frequency	Frequency Encoded Value
Electronics	12,000	12000
Books	4,500	4500
Luxury Watches	250	250

Comparing Encoding Methods

Encoding Method	Best For	Advantages	Limitations
One-Hot One-Hot Encoding	Nominal variables with few categories.	No false ordering; easy to interpret.	Creates many columns for high-cardinality variables.
Label Label Encoding	Tree-based models or category IDs in some cases.	Simple and compact.	Can create false order in unordered categories.
Ordinal Ordinal Encoding	Ordered categories.	Preserves meaningful order.	Assumes distances between levels are meaningful.
Target Target Encoding	High-cardinality categorical variables.	Compact and can capture target relationship.	High leakage and overfitting risk if done incorrectly.
Frequency Frequency Encoding	Variables where category popularity matters.	Compact and simple.	Different categories with same frequency become indistinguishable.

Choosing the Right Encoding Method

Encoding Decision Flow

Identify Category Type

→

Check Cardinality

→

Check Model Type

→

Assess Leakage Risk

→

Validate Performance

Use One-Hot Encoding When

Categories are unordered.
Unique category count is low or moderate.
You want interpretability.
You are using linear or distance-based models.

Use Ordinal Encoding When

Categories have real order.
Business meaning supports ranking.
Examples include low, medium, high.
The order should be defined carefully.

Use Target Encoding When

There are many unique categories.
Category has strong relationship with target.
You use cross-validation or out-of-fold encoding.
You apply smoothing for rare categories.

Use Frequency Encoding When

Category frequency itself may be predictive.
You need compact representation.
High-cardinality variable is difficult to one-hot encode.
You want a simple alternative to target encoding.

Example: Encoding for Customer Churn Prediction

Business Problem

A telecom company wants to predict customer churn. The dataset contains categorical variables such as contract type, payment method, city, customer segment, and satisfaction level.

Categorical Variable	Variable Type	Recommended Encoding	Reason
Contract Type	Nominal, low-cardinality.	One-hot encoding.	No natural order between monthly, yearly, and two-year contracts.
Payment Method	Nominal, low-cardinality.	One-hot encoding.	Payment method categories are unordered.
City	High-cardinality.	Target encoding or frequency encoding.	One-hot encoding may create too many columns.
Satisfaction Level	Ordinal.	Ordinal encoding.	Low, medium, and high satisfaction have meaningful order.
Customer Segment	Nominal or ordinal depending on definition.	One-hot or ordinal encoding.	Encoding depends on whether the segment has true ranking.

Example: Encoding for House Price Prediction

Regression Problem

A real estate company wants to predict house prices. The dataset contains location, property type, furnishing status, builder name, and property condition.

Location: High-cardinality; target encoding may be useful but must be done carefully.
Property Type: Apartment, villa, plot; one-hot encoding is usually suitable.
Furnishing Status: Unfurnished, semi-furnished, fully furnished; ordinal encoding may be suitable if business order is meaningful.
Builder Name: High-cardinality; frequency or target encoding may be considered.
Property Condition: Poor, average, good, excellent; ordinal encoding may be useful.

Handling Rare and Unknown Categories

Real-world data often contains rare categories and new categories that appear during prediction but were not present during training. These must be handled carefully.

Problem	Example	Recommended Treatment
Rare Categories	A city appears only 3 times in training data.	Group rare categories into “Other”.
Unknown Category in Test Data	A new payment method appears after deployment.	Use encoders that can handle unknown categories safely.
Too Many One-Hot Columns	Product ID has 50,000 unique values.	Use frequency encoding, target encoding, grouping, or advanced embeddings.
Inconsistent Category Labels	“Bangalore”, “Bengaluru”, and “BLR” used together.	Standardize categories before encoding.

Encoding and Train-Test Split

Encoding should be fitted using only the training data and then applied to validation and test data. This avoids data leakage and makes evaluation more realistic.

Safe Workflow: Split the dataset first, fit the encoder on training data only, then transform validation and test data using the fitted encoder.

Safe Encoding Pipeline

Split Data

→

Fit Encoder on Training Set

→

Transform Validation Set

→

Transform Test Set

→

Train and Evaluate Model

Common Mistakes in Categorical Encoding

Mistake	Why It Is Harmful	Better Approach
Using label encoding for unordered categories	Creates false numerical order.	Use one-hot encoding for nominal variables when cardinality is manageable.
One-hot encoding very high-cardinality variables	Creates too many sparse columns.	Use grouping, frequency encoding, or target encoding.
Target encoding before data split	Leaks target information into training features.	Use training-only or out-of-fold target encoding.
Ignoring unknown categories	Model may fail when new categories appear in production.	Use an “Unknown” or “Other” strategy.
Not standardizing category names	Same category may be treated as multiple categories.	Clean and standardize labels before encoding.

Best Practices for Encoding Categorical Variables

Categorical Encoding Checklist

Identify variable type: Check whether the category is nominal, ordinal, binary, or high-cardinality.
Clean categories first: Standardize spelling, spacing, case, and labels before encoding.
Use one-hot encoding for unordered low-cardinality variables: Avoid false ordering.
Use ordinal encoding only when order is meaningful: Define the order using business logic.
Be careful with label encoding: It can mislead models that treat numbers as ordered values.
Use target encoding safely: Fit only on training data and use out-of-fold encoding when possible.
Handle rare categories: Group rare values into “Other” if needed.
Prepare for unknown categories: New categories may appear in validation, test, or production data.
Validate model performance: Compare encoding methods using validation data.

Why Encoding is a Strategic Modelling Decision

Encoding is not just a technical preprocessing step. It directly affects how the model understands categories. The same categorical variable may need different encoding depending on the number of categories, model type, business meaning, and leakage risk.

Good encoding preserves useful information, avoids false assumptions, controls dimensionality, and improves model reliability.

Practical Insight: The best encoding method is the one that represents category information correctly without adding leakage, unnecessary complexity, or false order.

Key Takeaways

Categorical variables must usually be converted into numerical format before modelling.
One-hot encoding is suitable for nominal variables with manageable category counts.
Label encoding assigns integer codes but may create false order for unordered categories.
Ordinal encoding should be used only when categories have a meaningful rank.
Target encoding can be useful for high-cardinality variables but has high leakage risk.
Frequency encoding replaces categories with their occurrence counts or proportions.
Rare and unknown categories should be handled carefully.
Encoders should be fitted on training data only and then applied to validation and test data.

4.2 Encoding categorical variables