Ensemble Methods¶

What This Is¶

Ensemble methods combine multiple models to produce a stronger prediction than any single model alone. The three core strategies are bagging, boosting, and stacking. Each addresses a different weakness of a single learner:

bagging trains many models on bootstrap samples and averages — reduces variance
boosting trains models sequentially, each correcting the previous errors — reduces bias
stacking trains diverse base models and learns a meta-model on their outputs — can reduce both, at the cost of complexity

The decision is never "use an ensemble"; it is "which weakness does my single model have, and which ensemble strategy targets that weakness?"

When You Use It¶

a single model is unstable or has high variance (bagging / random forest)
a single model is too weak and underfits (boosting)
you have several reliably diverse models and want to combine them (stacking, voting)
you need a strong tabular baseline without heavy feature engineering
you need predictions that are robust to label noise (bagging helps; boosting does not)

Do Not Use It When¶

the data is tiny and the features are weak — an ensemble of weak signals is still a weak signal
you need to explain individual predictions to non-technical stakeholders — random forests and boosted trees are harder to defend than a single linear model
latency is tight — running N models is N times the inference cost unless you compress
the gain over a calibrated baseline is within noise — do not pay the complexity tax for a win the confidence interval eats

Strategy Comparison¶

Strategy	How It Works	Reduces	Risk
bagging	train many models on bootstrap samples, average predictions	variance	limited improvement on bias
boosting	train models sequentially, each correcting the previous errors	bias	overfits without regularization
stacking	train diverse base models, then train a meta-model on their out-of-fold predictions	both	complexity, leakage risk
voting (hard)	majority vote across models	noise	ignores confidence
voting (soft)	average probabilities across models	variance + noise	requires calibrated outputs

Tooling¶

Estimator	Type	When to use
`RandomForestClassifier`	bagging	strong default for tabular data
`GradientBoostingClassifier`	boosting	sequential error correction on small data
`HistGradientBoostingClassifier`	boosting	faster on large datasets, handles missing values
`LGBMClassifier` / `XGBClassifier` / `CatBoostClassifier`	boosting	current tabular state of the art
`AdaBoostClassifier`	boosting	simple baseline; mostly historical
`BaggingClassifier`	bagging	wrapping any base estimator with bootstrap
`VotingClassifier`	voting	simple combination of diverse models
`StackingClassifier`	stacking	meta-learner on top of base models

Minimal Examples¶

Random forest¶

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, max_depth=None,
                            min_samples_leaf=2, n_jobs=-1, random_state=0)
rf.fit(X_train, y_train)
print(f"Validation accuracy: {rf.score(X_valid, y_valid):.3f}")

Gradient boosting¶

from sklearn.ensemble import HistGradientBoostingClassifier

gb = HistGradientBoostingClassifier(
    max_iter=500, max_depth=6, learning_rate=0.05,
    early_stopping=True, validation_fraction=0.1, random_state=0,
)
gb.fit(X_train, y_train)

Stacking with honest CV¶

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

estimators = [
    ("dt",  DecisionTreeClassifier(max_depth=5, random_state=0)),
    ("svc", SVC(kernel="rbf", probability=True, random_state=0)),
    ("rf",  RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=0)),
]
stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=StratifiedKFold(5, shuffle=True, random_state=0),   # honest out-of-fold preds
    stack_method="predict_proba",
    passthrough=False,
)
stack.fit(X_train, y_train)

The cv= argument is the single most important line in the stacking snippet. Without it, the meta-learner sees predictions from base models that were trained on the same rows the meta-learner is trained on — a textbook case of leakage masquerading as improvement.

Stacking With Honest Cross-Validation¶

The leakage trap in stacking is easy to miss:

wrong: fit base models on all of X_train, predict on X_train, train the meta on those predictions. The base models have seen every row they are now being scored on. The meta learns the base models' training-set overconfidence, not their held-out behavior.
right: generate out-of-fold predictions for each base model via K-fold CV, train the meta on those, then refit each base model on all of X_train for inference.

StackingClassifier does this automatically when you pass a CV object; a hand-rolled stack has to do it explicitly. If meta-features are computed the wrong way, the stacking score on the training fold can be meaningless and the gap to the validation fold enormous.

For a blending variant (less principled, faster, leakage-prone): train base models on one half of the training set, generate predictions on the other half, train the meta on those. Blending is simpler to code but wastes data and forces the meta to learn from fewer rows.

Diversity Is The Point — Inspect It¶

An ensemble of three models that always agree is just one model with triple the inference cost. The benefit of ensembling scales with the diversity of the base learners, not their individual strength.

Inspect diversity with pairwise correlation of the out-of-fold predictions:

import numpy as np
import pandas as pd

oof = pd.DataFrame({
    "rf":  rf_oof_scores,
    "gb":  gb_oof_scores,
    "svc": svc_oof_scores,
    "lr":  lr_oof_scores,
})
print(oof.corr(method="pearson").round(3))

Read-out:

correlations > 0.95 across all pairs → diversity is missing; stacking will not help much
one correlation < 0.8 → that model carries a different error pattern; the meta can use it
correlations negative across a subset → almost certainly a bug in the pipelines, not real diversity

A common move in competitions: keep the model with the best standalone score, then drop otherwise-strong models that correlate above 0.97 with it, in favor of weaker but less correlated signals. The meta wins more from diversity than from average strength.

Feature Importance — Tool, Not Ground Truth¶

Ensemble methods provide built-in feature importance:

importances = rf.feature_importances_
sorted_idx = importances.argsort()[::-1]
for i in sorted_idx[:10]:
    print(f"  {feature_names[i]:>25}: {importances[i]:.4f}")

Treat the output as a debugging tool, not as causal evidence. Tree-based importance is biased toward high-cardinality features and can swap heavily across reruns. The honest companion is permutation importance on a held-out set: shuffle a feature, rerun inference, measure the score drop.

from sklearn.inspection import permutation_importance
r = permutation_importance(rf, X_valid, y_valid, n_repeats=10, random_state=0, n_jobs=-1)
order = r.importances_mean.argsort()[::-1]

If tree importance and permutation importance disagree sharply for a feature, trust permutation.

What To Inspect¶

individual-learner scores — they should be comparable, not identical; one much weaker learner can drag the ensemble down in voting and help in stacking
out-of-fold correlations — the diversity test above
learning curves per base learner — are the weaker ones plateauing for bias or variance reasons? That shapes the next move
number of estimators — both bagging and boosting have a point where more hurts; for bagging it plateaus, for boosting it overfits
calibration of the ensemble output — bagged trees tend to be overconfident; boosted trees can be systematically miscalibrated at the tails. See Calibration and Thresholds
training-vs-validation gap — boosting with a big learning_rate and many rounds reliably overfits

Failure Pattern¶

The canonical bagging failure is using a random forest with unlimited depth on a small dataset, trusting the training score, and being confused when validation collapses. The fix is min_samples_leaf > 1 and honest cross-validation.

The canonical boosting failure is too many rounds without early stopping. The training loss keeps dropping; the validation loss starts rising unnoticed. The fix is always: early stopping on a held-out fold.

The canonical stacking failure is leakage in meta-feature construction — the meta-learner sees in-sample base predictions. The fix is cv=... in StackingClassifier, or manual K-fold OOF in hand-rolled stacks.

Common Mistakes¶

setting n_estimators too low for boosting and giving up before the curve has stabilized
not tuning max_depth, learning_rate, or min_samples_leaf for boosting
treating feature importance as definitive without the permutation check
assuming ensembles always beat simple models — on tiny data, they do not
stacking without honest out-of-fold predictions (leakage masquerading as gain)
combining models whose OOF correlation is > 0.97 — pure cost, no benefit
comparing ensemble validation scores across runs with different CV seeds — a seed effect can look like an ensemble effect

Decision: Which Ensemble Strategy¶

Situation	Start with	Escalate to
tabular, low bias, high variance	random forest	boosted trees if diversity helps
tabular, high bias, plenty of data	LightGBM / XGBoost / CatBoost	tuned version with early stopping
tabular, several diverse signals (tree, linear, kNN)	soft-voting ensemble	stacking with LR meta
tiny or noisy data	single well-regularized model	bagging of that model
competition, long time budget	stacking with diverse base learners + honest OOF	bigger and more diverse base set

A practical rule: random forest is the "know-nothing" tabular baseline; boosted trees are the default strong tabular model; stacking is the endgame when the easy gains are gone.

Practice¶

Compare a single decision tree against a random forest on the same split. Report both the mean CV score and the band. Which model's gain is larger than the band?
Train three base models (tree, SVC, logistic regression) on a dataset. Compute the out-of-fold prediction correlation matrix. Identify which pair carries the most diversity.
Build a stacking classifier with cv=5 and compare it to the best single model. Run once with cv= omitted and show the leakage by comparing training vs. validation scores.
Train a boosted model with learning_rate=0.3 and 1000 rounds, no early stopping. Plot training vs. validation loss. Identify the overfit point.
Compute feature-importance and permutation-importance rankings. Name the feature where they disagree most and explain why.
On a small dataset, compare a well-regularized logistic regression against an ensemble. Decide which wins and whether the win is worth the complexity.

Runnable Example¶

Longer Connection¶

Ensemble methods sit next to:

Learning Curves and Bias-Variance — the diagnostic that tells you whether bagging (variance) or boosting (bias) is the right move
Hyperparameter Tuning — boosted trees have the largest payoff per hour of tuning in classical ML
Cross-Validation — the split strategy that keeps stacking honest
Calibration and Thresholds — bagging and boosting both produce miscalibrated probabilities in characteristic ways
Ensemble Temptation — the clinic that forces "when is the ensemble worth the complexity?"

The ensemble is not the point. The decision — bagging for variance, boosting for bias, stacking for diverse signal — is the point. If the student cannot state which weakness the ensemble is targeting, they do not yet need the ensemble.