Feature Selection¶
What This Is¶
Feature selection finds which features actually help the model and which ones are noise, redundant, or actively harmful. Unlike dimensionality reduction (which transforms features), feature selection keeps the original features and drops the rest.
When You Use It¶
- too many features slow down training or cause overfitting
- you suspect some features are leaking the target
- you want a simpler, more interpretable model
- you need to explain which inputs matter to a stakeholder
The Three Families¶
| Family | How It Works | Speed | Accounts for Model? |
|---|---|---|---|
| Filter | rank features by a statistical score, independent of the model | fastest | no |
| Wrapper | train models with different feature subsets, pick the best | slowest | yes |
| Embedded | the model learns feature importance during training | medium | yes |
Filter Methods — Start Here¶
Filter methods score each feature independently. They are fast and model-agnostic.
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
# ANOVA F-test (linear relationships)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
# Mutual information (captures nonlinear relationships)
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected_mi = selector_mi.fit_transform(X_train, y_train)
Which score to use¶
f_classif/f_regression: fast, assumes linear relationship, good first checkmutual_info_classif/mutual_info_regression: slower, captures nonlinear signal, needs more datachi2: for non-negative features (e.g., word counts)
Reading the scores¶
scores = selector.scores_
feature_ranking = sorted(zip(feature_names, scores), key=lambda x: -x[1])
for name, score in feature_ranking[:10]:
print(f" {name:>25}: {score:.2f}")
Correlation-Based Pruning¶
When two features are highly correlated, one is usually redundant:
corr_matrix = df[feature_cols].corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
Wrapper Methods — Best Subset¶
Wrapper methods train models on different subsets and pick the one that performs best.
from sklearn.feature_selection import SequentialFeatureSelector
sfs = SequentialFeatureSelector(
estimator=LogisticRegression(max_iter=1000),
n_features_to_select=10,
direction="forward",
cv=5,
)
sfs.fit(X_train, y_train)
selected_mask = sfs.get_support()
direction="forward": start empty, add features one by onedirection="backward": start full, remove features one by one
Wrapper methods are slow but find feature combinations that work together.
Embedded Methods — Model-Based¶
Some models learn feature importance as part of training:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_valid, y_valid, n_repeats=10, random_state=0)
Why permutation importance is better¶
Built-in feature_importances_ can be biased toward high-cardinality features. Permutation importance measures the actual impact on validation performance and is model-agnostic.
The Selection Ladder¶
- Correlation pruning to remove obvious redundancy
- Filter methods (
SelectKBest) for a fast first pass - Permutation importance to validate which features actually help the model
- Sequential selection only when you need the best possible small subset
Failure Pattern¶
Selecting features on the full dataset before splitting. If the selection step sees validation data, it can pick features that overfit to the specific split.
Another failure: trusting tree-based feature_importances_ on one-hot-encoded features, where importance is split across the dummy columns.
Common Mistakes¶
- running
SelectKBeston the entire dataset including test data - using filter methods alone when feature interactions matter
- dropping a feature because its individual score is low, even though it helps in combination
- confusing correlation with causation when reading importance scores
Practice¶
- Apply
SelectKBestwithf_classifandmutual_info_classifand compare which features are selected. - Remove highly correlated features and check whether model performance changes.
- Compare built-in feature importance against permutation importance for a random forest.
- Use forward sequential selection with 5-fold CV and report the best feature subset.
- Explain why feature selection must happen after the train/test split, not before.
Runnable Example¶
Longer Connection¶
Continue with Dimensionality Reduction for an alternative approach that transforms features instead of dropping them, and Hyperparameter Tuning for the full selection-and-tuning workflow.