Encoding Categorical Variables: One-Hot, Label, and Target Encoding
Many real-world datasets contain categorical variables such as city, gender, payment method, product category, education level, customer segment, and contract type. Machine learning models usually require numerical input, so these categories must be converted into numbers before modelling.
This conversion process is called categorical encoding. Choosing the right encoding method is important because poor encoding can confuse the model, create false order, increase dimensionality, or even cause data leakage.
What are Categorical Variables?
Categorical variables represent groups, labels, or categories instead of continuous numerical values. They describe qualitative properties of an observation.
For example, payment method may contain categories such as UPI, credit card, debit card, wallet, and cash. These values cannot be directly understood by most machine learning algorithms unless they are encoded numerically.
Core Idea: Encoding converts category labels into numerical representations so that machine learning models can use them as input features.
Types of Categorical Variables
| Type | Meaning | Examples | Encoding Consideration |
|---|---|---|---|
| Nominal Categorical | Categories have no natural order. | City, gender, product category, payment method. | One-hot encoding is often suitable. |
| Ordinal Categorical | Categories have meaningful order. | Low/Medium/High, education level, satisfaction rating. | Ordinal or label encoding may be suitable if order is meaningful. |
| High-Cardinality Categorical | Variable has many unique categories. | PIN code, product ID, customer city, merchant ID. | Target, frequency, grouping, or embedding-style approaches may be needed. |
| Binary Categorical | Only two categories. | Yes/No, Active/Inactive, Male/Female. | Can be encoded as 0 and 1. |
Why Encoding Matters in Predictive Modelling
Common Encoding Techniques
Visual Overview of Encoding Methods
1. One-Hot Encoding
One-hot encoding creates a separate binary column for each category. Each new column contains 1 if the observation belongs to that category and 0 otherwise.
This method is commonly used for nominal categorical variables where categories do not have natural order.
| Original Payment Method | UPI | Card | Cash | Wallet |
|---|---|---|---|---|
| UPI | 1 | 0 | 0 | 0 |
| Card | 0 | 1 | 0 | 0 |
| Cash | 0 | 0 | 1 | 0 |
| Wallet | 0 | 0 | 0 | 1 |
- Categories are nominal with no natural order.
- The number of unique categories is small or moderate.
- You are using linear models, logistic regression, SVM, or neural networks.
- You want to avoid false ordering between categories.
- The variable has hundreds or thousands of categories.
- The dataset becomes too wide after encoding.
- Many categories are rare or appear only once.
- You need memory-efficient modelling.
2. Label Encoding
Label encoding assigns a unique integer to each category. For example, Red = 0, Blue = 1, Green = 2. This creates one numerical column instead of many binary columns.
Label encoding is simple, but it can be risky for nominal variables because the model may incorrectly assume an order or distance between categories.
| Original Category | Label Encoded Value | Interpretation Risk |
|---|---|---|
| Delhi | 0 | Model may wrongly treat Delhi as lower than Mumbai. |
| Mumbai | 1 | Number is only a code, not a real rank. |
| Kolkata | 2 | Model may wrongly assume Kolkata is greater than Delhi. |
Important: Label encoding should not be used blindly for unordered categories in models that interpret numerical order, such as linear regression or logistic regression.
3. Ordinal Encoding
Ordinal encoding is similar to label encoding, but the assigned numbers follow a real meaningful order. This is suitable when categories naturally represent levels or ranks.
| Original Category | Ordinal Encoded Value | Why It Makes Sense |
|---|---|---|
| Low | 1 | Lowest risk or intensity level. |
| Medium | 2 | Middle level. |
| High | 3 | Highest risk or intensity level. |
Ordinal encoding is useful for variables such as education level, satisfaction rating, risk category, product quality grade, and customer priority level.
4. Target Encoding
Target encoding replaces each category with a value based on the average target value for that category. It is especially useful for high-cardinality categorical variables where one-hot encoding would create too many columns.
For example, in a churn prediction problem, each city can be replaced by the churn rate of customers from that city.
| City | Number of Customers | Churn Rate | Target Encoded Value |
|---|---|---|---|
| Delhi | 1,000 | 18% | 0.18 |
| Mumbai | 850 | 31% | 0.31 |
| Kolkata | 300 | 47% | 0.47 |
High-Risk Warning: Target encoding can easily cause data leakage if target averages are calculated using validation or test data. It must be fitted only on training data, and preferably using cross-validation or smoothing.
Target Encoding and Leakage
Target encoding uses the target variable to create a feature. This makes it powerful but risky. If done incorrectly, the model may indirectly learn the answer from the target itself.
| Approach | Safe or Risky? | Reason |
|---|---|---|
| Calculate target encoding before train-test split | Risky | Test data information leaks into training features. |
| Fit target encoding only on training data | Safer | Validation and test data remain unseen during encoding calculation. |
| Use out-of-fold target encoding | Best Practice | Each training row receives an encoded value calculated without using its own target. |
| Use smoothing for rare categories | Recommended | Reduces overfitting for categories with very few observations. |
5. Frequency Encoding
Frequency encoding replaces each category with the number of times it appears in the dataset, or with its percentage frequency. It is useful when category popularity itself carries predictive meaning.
For example, a commonly used product category may behave differently from a rare product category. Similarly, popular cities or high-volume merchants may have more stable behavioural patterns.
| Product Category | Frequency | Frequency Encoded Value |
|---|---|---|
| Electronics | 12,000 | 12000 |
| Books | 4,500 | 4500 |
| Luxury Watches | 250 | 250 |
Comparing Encoding Methods
| Encoding Method | Best For | Advantages | Limitations |
|---|---|---|---|
| One-Hot One-Hot Encoding |
Nominal variables with few categories. | No false ordering; easy to interpret. | Creates many columns for high-cardinality variables. |
| Label Label Encoding |
Tree-based models or category IDs in some cases. | Simple and compact. | Can create false order in unordered categories. |
| Ordinal Ordinal Encoding |
Ordered categories. | Preserves meaningful order. | Assumes distances between levels are meaningful. |
| Target Target Encoding |
High-cardinality categorical variables. | Compact and can capture target relationship. | High leakage and overfitting risk if done incorrectly. |
| Frequency Frequency Encoding |
Variables where category popularity matters. | Compact and simple. | Different categories with same frequency become indistinguishable. |
Choosing the Right Encoding Method
Encoding Decision Flow
- Categories are unordered.
- Unique category count is low or moderate.
- You want interpretability.
- You are using linear or distance-based models.
- Categories have real order.
- Business meaning supports ranking.
- Examples include low, medium, high.
- The order should be defined carefully.
- There are many unique categories.
- Category has strong relationship with target.
- You use cross-validation or out-of-fold encoding.
- You apply smoothing for rare categories.
- Category frequency itself may be predictive.
- You need compact representation.
- High-cardinality variable is difficult to one-hot encode.
- You want a simple alternative to target encoding.
Example: Encoding for Customer Churn Prediction
Business Problem
A telecom company wants to predict customer churn. The dataset contains categorical variables such as contract type, payment method, city, customer segment, and satisfaction level.
| Categorical Variable | Variable Type | Recommended Encoding | Reason |
|---|---|---|---|
| Contract Type | Nominal, low-cardinality. | One-hot encoding. | No natural order between monthly, yearly, and two-year contracts. |
| Payment Method | Nominal, low-cardinality. | One-hot encoding. | Payment method categories are unordered. |
| City | High-cardinality. | Target encoding or frequency encoding. | One-hot encoding may create too many columns. |
| Satisfaction Level | Ordinal. | Ordinal encoding. | Low, medium, and high satisfaction have meaningful order. |
| Customer Segment | Nominal or ordinal depending on definition. | One-hot or ordinal encoding. | Encoding depends on whether the segment has true ranking. |
Example: Encoding for House Price Prediction
Regression Problem
A real estate company wants to predict house prices. The dataset contains location, property type, furnishing status, builder name, and property condition.
- Location: High-cardinality; target encoding may be useful but must be done carefully.
- Property Type: Apartment, villa, plot; one-hot encoding is usually suitable.
- Furnishing Status: Unfurnished, semi-furnished, fully furnished; ordinal encoding may be suitable if business order is meaningful.
- Builder Name: High-cardinality; frequency or target encoding may be considered.
- Property Condition: Poor, average, good, excellent; ordinal encoding may be useful.
Handling Rare and Unknown Categories
Real-world data often contains rare categories and new categories that appear during prediction but were not present during training. These must be handled carefully.
| Problem | Example | Recommended Treatment |
|---|---|---|
| Rare Categories | A city appears only 3 times in training data. | Group rare categories into “Other”. |
| Unknown Category in Test Data | A new payment method appears after deployment. | Use encoders that can handle unknown categories safely. |
| Too Many One-Hot Columns | Product ID has 50,000 unique values. | Use frequency encoding, target encoding, grouping, or advanced embeddings. |
| Inconsistent Category Labels | “Bangalore”, “Bengaluru”, and “BLR” used together. | Standardize categories before encoding. |
Encoding and Train-Test Split
Encoding should be fitted using only the training data and then applied to validation and test data. This avoids data leakage and makes evaluation more realistic.
Safe Workflow: Split the dataset first, fit the encoder on training data only, then transform validation and test data using the fitted encoder.
Safe Encoding Pipeline
Common Mistakes in Categorical Encoding
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Using label encoding for unordered categories | Creates false numerical order. | Use one-hot encoding for nominal variables when cardinality is manageable. |
| One-hot encoding very high-cardinality variables | Creates too many sparse columns. | Use grouping, frequency encoding, or target encoding. |
| Target encoding before data split | Leaks target information into training features. | Use training-only or out-of-fold target encoding. |
| Ignoring unknown categories | Model may fail when new categories appear in production. | Use an “Unknown” or “Other” strategy. |
| Not standardizing category names | Same category may be treated as multiple categories. | Clean and standardize labels before encoding. |
Best Practices for Encoding Categorical Variables
Categorical Encoding Checklist
- Identify variable type: Check whether the category is nominal, ordinal, binary, or high-cardinality.
- Clean categories first: Standardize spelling, spacing, case, and labels before encoding.
- Use one-hot encoding for unordered low-cardinality variables: Avoid false ordering.
- Use ordinal encoding only when order is meaningful: Define the order using business logic.
- Be careful with label encoding: It can mislead models that treat numbers as ordered values.
- Use target encoding safely: Fit only on training data and use out-of-fold encoding when possible.
- Handle rare categories: Group rare values into “Other” if needed.
- Prepare for unknown categories: New categories may appear in validation, test, or production data.
- Validate model performance: Compare encoding methods using validation data.
Why Encoding is a Strategic Modelling Decision
Encoding is not just a technical preprocessing step. It directly affects how the model understands categories. The same categorical variable may need different encoding depending on the number of categories, model type, business meaning, and leakage risk.
Good encoding preserves useful information, avoids false assumptions, controls dimensionality, and improves model reliability.
Practical Insight: The best encoding method is the one that represents category information correctly without adding leakage, unnecessary complexity, or false order.
Key Takeaways
- Categorical variables must usually be converted into numerical format before modelling.
- One-hot encoding is suitable for nominal variables with manageable category counts.
- Label encoding assigns integer codes but may create false order for unordered categories.
- Ordinal encoding should be used only when categories have a meaningful rank.
- Target encoding can be useful for high-cardinality variables but has high leakage risk.
- Frequency encoding replaces categories with their occurrence counts or proportions.
- Rare and unknown categories should be handled carefully.
- Encoders should be fitted on training data only and then applied to validation and test data.