Skip to content

Ensemble Methods

What This Is

Ensemble methods combine multiple models to produce a stronger prediction than any single model alone. The three core strategies are bagging, boosting, and stacking. Each addresses a different weakness of a single learner:

  • bagging trains many models on bootstrap samples and averages — reduces variance
  • boosting trains models sequentially, each correcting the previous errors — reduces bias
  • stacking trains diverse base models and learns a meta-model on their outputs — can reduce both, at the cost of complexity

The decision is never "use an ensemble"; it is "which weakness does my single model have, and which ensemble strategy targets that weakness?"

When You Use It

  • a single model is unstable or has high variance (bagging / random forest)
  • a single model is too weak and underfits (boosting)
  • you have several reliably diverse models and want to combine them (stacking, voting)
  • you need a strong tabular baseline without heavy feature engineering
  • you need predictions that are robust to label noise (bagging helps; boosting does not)

Do Not Use It When

  • the data is tiny and the features are weak — an ensemble of weak signals is still a weak signal
  • you need to explain individual predictions to non-technical stakeholders — random forests and boosted trees are harder to defend than a single linear model
  • latency is tight — running N models is N times the inference cost unless you compress
  • the gain over a calibrated baseline is within noise — do not pay the complexity tax for a win the confidence interval eats

Strategy Comparison

Strategy How It Works Reduces Risk
bagging train many models on bootstrap samples, average predictions variance limited improvement on bias
boosting train models sequentially, each correcting the previous errors bias overfits without regularization
stacking train diverse base models, then train a meta-model on their out-of-fold predictions both complexity, leakage risk
voting (hard) majority vote across models noise ignores confidence
voting (soft) average probabilities across models variance + noise requires calibrated outputs

Tooling

Estimator Type When to use
RandomForestClassifier bagging strong default for tabular data
GradientBoostingClassifier boosting sequential error correction on small data
HistGradientBoostingClassifier boosting faster on large datasets, handles missing values
LGBMClassifier / XGBClassifier / CatBoostClassifier boosting current tabular state of the art
AdaBoostClassifier boosting simple baseline; mostly historical
BaggingClassifier bagging wrapping any base estimator with bootstrap
VotingClassifier voting simple combination of diverse models
StackingClassifier stacking meta-learner on top of base models

Minimal Examples

Random forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, max_depth=None,
                            min_samples_leaf=2, n_jobs=-1, random_state=0)
rf.fit(X_train, y_train)
print(f"Validation accuracy: {rf.score(X_valid, y_valid):.3f}")

Gradient boosting

from sklearn.ensemble import HistGradientBoostingClassifier

gb = HistGradientBoostingClassifier(
    max_iter=500, max_depth=6, learning_rate=0.05,
    early_stopping=True, validation_fraction=0.1, random_state=0,
)
gb.fit(X_train, y_train)

Stacking with honest CV

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

estimators = [
    ("dt",  DecisionTreeClassifier(max_depth=5, random_state=0)),
    ("svc", SVC(kernel="rbf", probability=True, random_state=0)),
    ("rf",  RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=0)),
]
stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=StratifiedKFold(5, shuffle=True, random_state=0),   # honest out-of-fold preds
    stack_method="predict_proba",
    passthrough=False,
)
stack.fit(X_train, y_train)

The cv= argument is the single most important line in the stacking snippet. Without it, the meta-learner sees predictions from base models that were trained on the same rows the meta-learner is trained on — a textbook case of leakage masquerading as improvement.

Stacking With Honest Cross-Validation

The leakage trap in stacking is easy to miss:

  • wrong: fit base models on all of X_train, predict on X_train, train the meta on those predictions. The base models have seen every row they are now being scored on. The meta learns the base models' training-set overconfidence, not their held-out behavior.
  • right: generate out-of-fold predictions for each base model via K-fold CV, train the meta on those, then refit each base model on all of X_train for inference.

StackingClassifier does this automatically when you pass a CV object; a hand-rolled stack has to do it explicitly. If meta-features are computed the wrong way, the stacking score on the training fold can be meaningless and the gap to the validation fold enormous.

For a blending variant (less principled, faster, leakage-prone): train base models on one half of the training set, generate predictions on the other half, train the meta on those. Blending is simpler to code but wastes data and forces the meta to learn from fewer rows.

Diversity Is The Point — Inspect It

An ensemble of three models that always agree is just one model with triple the inference cost. The benefit of ensembling scales with the diversity of the base learners, not their individual strength.

Inspect diversity with pairwise correlation of the out-of-fold predictions:

import numpy as np
import pandas as pd

oof = pd.DataFrame({
    "rf":  rf_oof_scores,
    "gb":  gb_oof_scores,
    "svc": svc_oof_scores,
    "lr":  lr_oof_scores,
})
print(oof.corr(method="pearson").round(3))

Read-out:

  • correlations > 0.95 across all pairs → diversity is missing; stacking will not help much
  • one correlation < 0.8 → that model carries a different error pattern; the meta can use it
  • correlations negative across a subset → almost certainly a bug in the pipelines, not real diversity

A common move in competitions: keep the model with the best standalone score, then drop otherwise-strong models that correlate above 0.97 with it, in favor of weaker but less correlated signals. The meta wins more from diversity than from average strength.

Feature Importance — Tool, Not Ground Truth

Ensemble methods provide built-in feature importance:

importances = rf.feature_importances_
sorted_idx = importances.argsort()[::-1]
for i in sorted_idx[:10]:
    print(f"  {feature_names[i]:>25}: {importances[i]:.4f}")

Treat the output as a debugging tool, not as causal evidence. Tree-based importance is biased toward high-cardinality features and can swap heavily across reruns. The honest companion is permutation importance on a held-out set: shuffle a feature, rerun inference, measure the score drop.

from sklearn.inspection import permutation_importance
r = permutation_importance(rf, X_valid, y_valid, n_repeats=10, random_state=0, n_jobs=-1)
order = r.importances_mean.argsort()[::-1]

If tree importance and permutation importance disagree sharply for a feature, trust permutation.

What To Inspect

  • individual-learner scores — they should be comparable, not identical; one much weaker learner can drag the ensemble down in voting and help in stacking
  • out-of-fold correlations — the diversity test above
  • learning curves per base learner — are the weaker ones plateauing for bias or variance reasons? That shapes the next move
  • number of estimators — both bagging and boosting have a point where more hurts; for bagging it plateaus, for boosting it overfits
  • calibration of the ensemble output — bagged trees tend to be overconfident; boosted trees can be systematically miscalibrated at the tails. See Calibration and Thresholds
  • training-vs-validation gap — boosting with a big learning_rate and many rounds reliably overfits

Failure Pattern

The canonical bagging failure is using a random forest with unlimited depth on a small dataset, trusting the training score, and being confused when validation collapses. The fix is min_samples_leaf > 1 and honest cross-validation.

The canonical boosting failure is too many rounds without early stopping. The training loss keeps dropping; the validation loss starts rising unnoticed. The fix is always: early stopping on a held-out fold.

The canonical stacking failure is leakage in meta-feature construction — the meta-learner sees in-sample base predictions. The fix is cv=... in StackingClassifier, or manual K-fold OOF in hand-rolled stacks.

Common Mistakes

  • setting n_estimators too low for boosting and giving up before the curve has stabilized
  • not tuning max_depth, learning_rate, or min_samples_leaf for boosting
  • treating feature importance as definitive without the permutation check
  • assuming ensembles always beat simple models — on tiny data, they do not
  • stacking without honest out-of-fold predictions (leakage masquerading as gain)
  • combining models whose OOF correlation is > 0.97 — pure cost, no benefit
  • comparing ensemble validation scores across runs with different CV seeds — a seed effect can look like an ensemble effect

Decision: Which Ensemble Strategy

Situation Start with Escalate to
tabular, low bias, high variance random forest boosted trees if diversity helps
tabular, high bias, plenty of data LightGBM / XGBoost / CatBoost tuned version with early stopping
tabular, several diverse signals (tree, linear, kNN) soft-voting ensemble stacking with LR meta
tiny or noisy data single well-regularized model bagging of that model
competition, long time budget stacking with diverse base learners + honest OOF bigger and more diverse base set

A practical rule: random forest is the "know-nothing" tabular baseline; boosted trees are the default strong tabular model; stacking is the endgame when the easy gains are gone.

Practice

  1. Compare a single decision tree against a random forest on the same split. Report both the mean CV score and the band. Which model's gain is larger than the band?
  2. Train three base models (tree, SVC, logistic regression) on a dataset. Compute the out-of-fold prediction correlation matrix. Identify which pair carries the most diversity.
  3. Build a stacking classifier with cv=5 and compare it to the best single model. Run once with cv= omitted and show the leakage by comparing training vs. validation scores.
  4. Train a boosted model with learning_rate=0.3 and 1000 rounds, no early stopping. Plot training vs. validation loss. Identify the overfit point.
  5. Compute feature-importance and permutation-importance rankings. Name the feature where they disagree most and explain why.
  6. On a small dataset, compare a well-regularized logistic regression against an ensemble. Decide which wins and whether the win is worth the complexity.

Runnable Example

Longer Connection

Ensemble methods sit next to:

The ensemble is not the point. The decision — bagging for variance, boosting for bias, stacking for diverse signal — is the point. If the student cannot state which weakness the ensemble is targeting, they do not yet need the ensemble.