Feature Selection Methods: Filter, Wrapper, and Embedded Techniques
Feature selection is the process of choosing the most useful input variables for a predictive model. Not every feature improves performance. Some features are irrelevant, redundant, noisy, highly correlated, or even harmful.
Good feature selection helps build models that are simpler, faster, more interpretable, and often more accurate on unseen data.
What is Feature Selection?
Feature selection means selecting a subset of relevant features from the available variables. The goal is to keep the features that provide useful predictive information and remove features that add little value or create unnecessary complexity.
For example, in a customer churn model, useful features may include customer tenure, complaint count, payment delays, and usage decline. Features such as customer ID or random internal codes may not help prediction and should usually be removed.
Core Idea: Feature selection is not about using fewer features blindly. It is about using the right features that improve model performance, stability, and interpretation.
Why Feature Selection Matters
Types of Feature Selection Methods
Three Major Feature Selection Families
| Method Family | How It Works | Examples | Main Advantage |
|---|---|---|---|
| Filter Filter Methods |
Rank or remove features using statistical measures before model training. | Correlation, chi-square, ANOVA, mutual information, variance threshold. | Fast and model-independent. |
| Wrapper Wrapper Methods |
Try different feature subsets and evaluate them using a model. | Forward selection, backward elimination, recursive feature elimination. | Considers model performance directly. |
Embedded Methods |
Feature selection happens during model training. | Lasso, Ridge, Elastic Net, decision tree importance, random forest importance. | Balances performance and efficiency. |
1. Filter Methods
Filter methods select features using statistical tests, scores, or simple rules. They are applied before training the model and do not depend on a specific machine learning algorithm.
Filter methods are usually fast, simple, and useful as an initial feature screening step.
| Filter Method | Best For | What It Measures | Example Use |
|---|---|---|---|
| Filter Correlation Analysis |
Numerical feature vs numerical target. | Strength and direction of linear relationship. | Select features strongly related to house price. |
| Filter Chi-Square Test |
Categorical features and categorical target. | Association between categories and target classes. | Check whether payment method relates to churn. |
| Filter ANOVA F-Test |
Numerical features and categorical target. | Whether feature values differ across target classes. | Check whether income differs between default and non-default groups. |
| Filter Mutual Information |
Linear and non-linear relationships. | How much information a feature gives about the target. | Rank predictors in classification or regression problems. |
| Filter Variance Threshold |
Removing near-constant features. | Whether a feature has enough variation. | Remove a column where 99.9% values are the same. |
- Fast and easy to apply.
- Useful for high-dimensional datasets.
- Independent of model choice.
- Good for early feature screening.
- May ignore feature interactions.
- May miss non-linear relationships if wrong test is used.
- Does not directly optimize model performance.
- Statistical relevance may not equal business usefulness.
2. Wrapper Methods
Wrapper methods evaluate different combinations of features by training a model and measuring performance. They are more model-aware than filter methods because they select features based on how well the model performs.
Wrapper methods can be powerful, but they are often computationally expensive because the model must be trained many times.
| Wrapper Method | How It Works | Best Used When | Risk |
|---|---|---|---|
| Wrapper Forward Selection |
Starts with no features and adds the best feature one by one. | Feature count is moderate and interpretability matters. | May miss combinations that are weak individually but strong together. |
| Wrapper Backward Elimination |
Starts with all features and removes the least useful feature step by step. | You have a manageable number of features and want a smaller model. | Can be slow if there are many features. |
| Wrapper Recursive Feature Elimination |
Trains a model, ranks features, removes weakest features, and repeats. | You want model-based iterative feature ranking. | Computationally expensive on large datasets. |
| Wrapper Exhaustive Search |
Tests all possible feature subsets. | Very small feature sets only. | Usually impractical because combinations grow rapidly. |
Forward Selection
Forward selection begins with an empty model. It adds one feature at a time based on which feature gives the best improvement in validation performance.
Backward Elimination
Backward elimination begins with all available features. It removes one feature at a time, usually the least useful feature, until removing more features harms model performance.
Recursive Feature Elimination
Recursive Feature Elimination, or RFE, repeatedly trains a model and removes the weakest feature or group of features. It continues until the desired number of features remains.
Important: Wrapper methods should use validation data or cross-validation. If feature selection is based on test data, the test set becomes contaminated and final performance becomes unreliable.
3. Embedded Methods
Embedded methods perform feature selection as part of model training. Unlike filter methods, they are model-aware. Unlike wrapper methods, they usually do not require repeatedly testing many feature subsets manually.
| Embedded Method | Model Type | How It Selects Features | Example Use |
|---|---|---|---|
Lasso Regression |
Linear model with L1 regularization. | Pushes some feature coefficients exactly to zero. | Select useful variables in regression or classification. |
Elastic Net |
Combination of L1 and L2 regularization. | Balances feature selection and coefficient stability. | Useful when correlated features exist. |
Decision Tree Importance |
Tree-based model. | Features used more effectively in splits receive higher importance. | Rank features in classification or regression trees. |
Random Forest Importance |
Ensemble of decision trees. | Aggregates importance across many trees. | Identify strong predictors in tabular data. |
Gradient Boosting Importance |
Boosted tree model. | Ranks features based on contribution to reducing prediction error. | Feature ranking in high-performing business models. |
- Feature selection happens during model training.
- Often faster than wrapper methods.
- Can capture model-specific feature usefulness.
- Useful for regularized and tree-based models.
- Feature importance depends on the chosen model.
- Tree-based importance can be biased toward high-cardinality or continuous variables.
- Correlated features may share importance unevenly.
- Important features should still be validated using performance and business logic.
Filter vs Wrapper vs Embedded Methods
| Aspect | Filter Methods | Wrapper Methods | Embedded Methods |
|---|---|---|---|
| Model Dependency | Model-independent. | Model-dependent. | Model-dependent. |
| Speed | Fastest. | Slowest. | Moderate to fast. |
| Performance Awareness | Does not directly optimize model performance. | Directly evaluates model performance. | Uses model training process to assess importance. |
| Feature Interaction Handling | Usually weak. | Better. | Depends on model type. |
| Best Use | Initial screening and high-dimensional data. | Smaller feature sets and final refinement. | Practical model-aware selection. |
Feature Selection Workflow
Practical Feature Selection Pipeline
Example: Feature Selection for Customer Churn Prediction
Business Problem
A telecom company wants to predict customer churn. The dataset contains 80 features, including customer profile, plan details, usage behaviour, complaint history, payment records, and marketing interactions.
| Step | Feature Selection Action | Reason |
|---|---|---|
| 1 | Remove customer ID, phone number, and internal record IDs. | These are identifiers, not meaningful predictive signals. |
| 2 | Use variance threshold to remove almost constant features. | Features with almost no variation rarely help prediction. |
| 3 | Use chi-square test for categorical features against churn. | Checks whether categories are associated with churn outcome. |
| 4 | Use correlation analysis to remove highly redundant numerical features. | Reduces multicollinearity and duplicate information. |
| 5 | Use Lasso or tree-based importance to identify strong predictors. | Embedded methods rank features during model training. |
| 6 | Validate final feature set using validation and test data. | Ensures selected features improve generalization. |
Example: Feature Selection for House Price Prediction
Regression Problem
A real estate company wants to predict house prices. The dataset contains property area, number of rooms, location, furnishing status, builder name, parking availability, floor number, property age, distance from city centre, and many engineered features.
- Filter: Use correlation analysis to find numerical features strongly related to price.
- Filter: Remove features with very high missingness or near-zero variance.
- Wrapper: Use recursive feature elimination with a regression model to refine the feature set.
- Embedded: Use Lasso or random forest feature importance to identify useful variables.
- Business Check: Keep location-related features even if some statistical methods under-rank them, because location is business-critical in real estate.
Feature Selection and Data Leakage
Feature selection can also cause data leakage if it is done using the full dataset before splitting. For example, if you select features based on their relationship with the target using the full dataset, information from the test set influences the selected feature set.
High-Risk Mistake: Selecting features using the entire dataset before train-test split can leak information from validation or test data into training. This makes model performance look better than it really is.
| Feature Selection Step | Safe Practice |
|---|---|
| Correlation with target | Calculate using training data only, then apply selected features to validation and test sets. |
| Chi-square or ANOVA selection | Fit selection procedure only on training data. |
| Recursive feature elimination | Use cross-validation inside training data; keep final test set untouched. |
| Tree-based feature importance | Train feature importance model only on training data. |
Features You Should Usually Remove Early
- Customer ID
- Transaction ID
- Phone number
- Email address
- Random internal record numbers
- Cancellation reason when predicting churn
- Default settlement status when predicting loan default
- Future sales when forecasting demand
- Outcome-generated timestamps
- Columns with almost one constant value
- Features with extremely high missingness
- Duplicate columns
- Features with unclear or unreliable definitions
- Highly correlated duplicate variables
- Same data represented in multiple formats
- Derived variables that duplicate raw variables
- Features that add complexity without performance gain
Common Mistakes in Feature Selection
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Selecting features before train-test split | Can leak test data information into training. | Split first, then perform feature selection on training data only. |
| Removing features only because correlation is low | Feature may have non-linear or interaction effects. | Use model-based importance and validation performance. |
| Keeping all features blindly | Can increase overfitting, noise, and training time. | Remove irrelevant, redundant, and leakage-prone features. |
| Relying on one method only | Different methods capture different types of usefulness. | Combine filter, embedded, validation, and business logic. |
| Ignoring business meaning | Statistical ranking may miss important domain variables. | Review selected features with domain understanding. |
Best Practices for Feature Selection
Feature Selection Checklist
- Start with business logic: Remove identifiers, leakage features, and meaningless columns early.
- Use filter methods for quick screening: Correlation, chi-square, ANOVA, mutual information, and variance threshold are useful first steps.
- Use wrapper methods when feature count is manageable: They directly evaluate model performance but can be slow.
- Use embedded methods for practical model-aware selection: Lasso and tree-based importance are common choices.
- Perform selection only on training data: Avoid leakage from validation or test data.
- Validate selected features: Compare model performance before and after selection.
- Check feature stability: Important features should remain useful across validation folds or time periods.
- Do not ignore interpretability: A slightly simpler model may be more valuable if it is easier to explain.
- Document the final feature set: Record why each feature was kept or removed.
Why Feature Selection is a Strategic Step
Feature selection directly affects model quality, speed, interpretability, and business trust. A model with too many weak features may overfit and become difficult to explain. A model with too few features may miss important signals.
The best feature selection process balances statistical evidence, model performance, and business understanding.
Practical Insight: Feature selection is not a one-time mechanical task. It is an iterative modelling decision that should be tested, validated, and documented.
Key Takeaways
- Feature selection chooses the most useful variables for predictive modelling.
- It improves model focus, speed, interpretability, and generalization.
- Filter methods use statistical measures before model training.
- Wrapper methods test feature subsets using model performance.
- Embedded methods select features during model training.
- Common methods include correlation, chi-square, ANOVA, mutual information, RFE, Lasso, and tree-based importance.
- Feature selection must be done on training data only to avoid leakage.
- The final feature set should be validated using model performance and business logic.