Feature Selection¶

What This Is¶

Feature selection finds which features actually help the model and which ones are noise, redundant, or actively harmful. Unlike dimensionality reduction (which transforms features), feature selection keeps the original features and drops the rest.

When You Use It¶

too many features slow down training or cause overfitting
you suspect some features are leaking the target
you want a simpler, more interpretable model
you need to explain which inputs matter to a stakeholder

The Three Families¶

Family	How It Works	Speed	Accounts for Model?
Filter	rank features by a statistical score, independent of the model	fastest	no
Wrapper	train models with different feature subsets, pick the best	slowest	yes
Embedded	the model learns feature importance during training	medium	yes

Filter Methods — Start Here¶

Filter methods score each feature independently. They are fast and model-agnostic.

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# ANOVA F-test (linear relationships)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

# Mutual information (captures nonlinear relationships)
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected_mi = selector_mi.fit_transform(X_train, y_train)

Which score to use¶

f_classif / f_regression: fast, assumes linear relationship, good first check
mutual_info_classif / mutual_info_regression: slower, captures nonlinear signal, needs more data
chi2: for non-negative features (e.g., word counts)

Reading the scores¶

scores = selector.scores_
feature_ranking = sorted(zip(feature_names, scores), key=lambda x: -x[1])
for name, score in feature_ranking[:10]:
    print(f"  {name:>25}: {score:.2f}")

Correlation-Based Pruning¶

When two features are highly correlated, one is usually redundant:

corr_matrix = df[feature_cols].corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]

Wrapper Methods — Best Subset¶

Wrapper methods train models on different subsets and pick the one that performs best.

from sklearn.feature_selection import SequentialFeatureSelector

sfs = SequentialFeatureSelector(
    estimator=LogisticRegression(max_iter=1000),
    n_features_to_select=10,
    direction="forward",
    cv=5,
)
sfs.fit(X_train, y_train)
selected_mask = sfs.get_support()

direction="forward": start empty, add features one by one
direction="backward": start full, remove features one by one

Wrapper methods are slow but find feature combinations that work together.

Embedded Methods — Model-Based¶

Some models learn feature importance as part of training:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_valid, y_valid, n_repeats=10, random_state=0)

Why permutation importance is better¶

Built-in feature_importances_ can be biased toward high-cardinality features. Permutation importance measures the actual impact on validation performance and is model-agnostic.

The Selection Ladder¶

Correlation pruning to remove obvious redundancy
Filter methods (SelectKBest) for a fast first pass
Permutation importance to validate which features actually help the model
Sequential selection only when you need the best possible small subset

Failure Pattern¶

Selecting features on the full dataset before splitting. If the selection step sees validation data, it can pick features that overfit to the specific split.

Another failure: trusting tree-based feature_importances_ on one-hot-encoded features, where importance is split across the dummy columns.

Common Mistakes¶

running SelectKBest on the entire dataset including test data
using filter methods alone when feature interactions matter
dropping a feature because its individual score is low, even though it helps in combination
confusing correlation with causation when reading importance scores

Practice¶

Apply SelectKBest with f_classif and mutual_info_classif and compare which features are selected.
Remove highly correlated features and check whether model performance changes.
Compare built-in feature importance against permutation importance for a random forest.
Use forward sequential selection with 5-fold CV and report the best feature subset.
Explain why feature selection must happen after the train/test split, not before.

Runnable Example¶

Longer Connection¶

Continue with Dimensionality Reduction for an alternative approach that transforms features instead of dropping them, and Hyperparameter Tuning for the full selection-and-tuning workflow.