Skip to content

Clinic 17

Kernel Choice

Linear, RBF, polynomial, or a handcrafted kernel — on a small tabular binary problem. The RBF result looks like a clear winner until you read how the split was made. Pick a kernel and defend it under the trap.

Situation

Four Kernels, One Trap

RBF's leaderboard number is two points above linear. The split is random over a dataset with heavy patient duplication across rows.

Your Job

Pick A Kernel

Choose linear, RBF, polynomial, or a domain kernel. Name what you would fix about the split before you trust the ranking.

Bad Habit To Avoid

Following The Scoreboard

A kernel that "wins" on a leaky split is information about the split, not about the kernel.

Situation

You are building a binary classifier for a medical tabular problem — 30 continuous features describing a patient visit, label is whether the visit resulted in a follow-up within 14 days.

Data and pipeline:

  • 12,000 rows labeled. The rows are visits, not patients: a single patient can appear 3–8 times.
  • the feature matrix is scaled once at the top and then split 80/20 at random with random_state=0
  • metric is ROC-AUC plus recall at a 20% review budget

Four SVM variants sit on the leaderboard:

variant test AUC fit time notes
Linear (LinearSVC in a scaled pipeline) 0.812 8 s baseline
RBF (SVC(kernel="rbf", C=1, gamma="scale")) 0.834 2.5 min one-line switch, looks like a free win
Polynomial (SVC(kernel="poly", degree=3, coef0=1)) 0.821 4 min sensitive to degree and coef0
Domain kernel (custom: scaled linear plus interaction terms for 3 flagged feature pairs) 0.818 20 s explainable, small lift

A teammate has already written the PR "switch to RBF" and added a doc comment: "RBF is the new default — 2.2 points of AUC for free."

Two things about the setup should worry you:

  • the 80/20 random split puts many of the same patients in both train and test. A patient seen on Monday and again on Friday can bleed across the split.
  • the feature scaling is fitted on the full dataset, not inside a pipeline, so the test split has seen the train-set statistics

Artifact Packet

Kernel trade-offs on this size of problem:

kernel when it works when it breaks tuning surface
Linear truly linear boundary; many features, modest nonlinearity nonlinear class geometry C only
RBF smooth nonlinear decision boundary leaks through gamma if unscaled; sensitive to bandwidth C and gamma
Polynomial problems with explicit interaction structure high variance at high degree; scaling matters C, degree, coef0
Domain when you know an interaction pattern (e.g., BMI × age) requires hand-building per-problem usually only C

The trap to inspect:

  • leaky split: random row-level split on a dataset with repeated patients. The model memorizes a patient on Monday and is tested on the same patient's Friday visit. RBF has enough capacity to exploit this; linear has less.
  • scaler leak: fitting StandardScaler on the full matrix before splitting means the test rows have already informed the train-set feature mean and variance. Not huge on this problem, but real.
  • tuning budget: RBF has two knobs; linear has one. On a leaky split, RBF wins partly because it has more places to overfit the leak.

Correct split is by patient: GroupKFold with groups=patient_id. Expected AUC under that split is typically 1–3 points lower across all kernels, and the gap between RBF and linear often collapses.

Decision Prompt

Write a six-sentence defense that answers:

  1. Which kernel do you pick, and under which split?
  2. What do you expect the RBF–linear gap to look like once the split is by patient?
  3. What is the single inspection that proves the current leaderboard is about the split, not the kernel?
  4. Before tuning gamma, what should be fixed in the pipeline?
  5. If the RBF lift survives the patient-grouped split, what does that tell you about the class geometry?
  6. When would the domain kernel be the right pick even if it is not the top of the leaderboard?

Strong Reasoning Looks Like

  • fixes the split first: move to GroupKFold(groups=patient_id) before any kernel decision; a leaky split makes every kernel ranking unreliable
  • fixes the scaler next: the StandardScaler must live inside the Pipeline so each fold fits its own scaler on train-only rows
  • picks linear as the honest baseline, only escalating to RBF if the patient-grouped AUC gap is ≥ 1 point with stable CV bands
  • names the inspection: rerun all four kernels under GroupKFold(n_splits=5) and compare the AUC bands; if the RBF band and the linear band overlap, RBF is not a real win
  • tunes C before gamma; only one knob at a time, and always inside cross-validation
  • treats the domain kernel as the right pick when the lift is explainable and the deployment context (medical audit) rewards explainability over a small AUC gain

Common Wrong Moves

  • merging the RBF PR because "2.2 points of AUC is a lot" — on a leaky split, it is not
  • re-tuning gamma before fixing the patient leak; any gamma that wins on a leaky split is a measurement of the leak
  • keeping the StandardScaler outside the pipeline — even a 0.3-point AUC lift from scaler leakage becomes the deciding factor at this margin
  • picking RBF because "nonlinear is more powerful"; capacity without generalization is a cost, not a feature
  • dropping the domain kernel because it lost by 0.4 points; an explainable model with a 0.4-point tax is often the right ship in a medical context
  • comparing fit times as a tiebreaker while ignoring the split — a leaderboard built on a bad split makes every other dimension secondary

Run The Clinic In Browser

Use the runner to rerun the four kernels under a patient-grouped split and watch the AUC bands collapse together.

Reference Reveal

Open only after you write the defense The reference call is **fix the split before choosing the kernel**. Once `GroupKFold(groups=patient_id)` is in place with the `StandardScaler` inside the pipeline, the expected outcome is: | kernel | random-split AUC (leaky) | patient-grouped AUC (honest) | gap | | --- | --- | --- | --- | | Linear | 0.812 | **0.785 ± 0.012** | — | | RBF | 0.834 | 0.791 ± 0.015 | +0.006 (inside the band) | | Polynomial | 0.821 | 0.778 ± 0.020 | −0.007 | | Domain | 0.818 | 0.783 ± 0.010 | −0.002 | In this honest view the RBF lift is **inside the CV band**. RBF's apparent 2.2-point win was mostly exploitation of the patient leak. The right ship for this medical context is **linear SVM** or the **domain kernel**: - linear has the smallest CV band and the clearest interpretation — coefficients map to feature importance - the domain kernel is ~1 point behind RBF in the leaky view but statistically tied once the leak is fixed, and it is auditable (the interaction terms are named) - RBF is defensible *only* if a follow-up run shows its patient-grouped AUC consistently ≥ 1 point above linear across 3+ seeds — and even then the audit story must account for the kernel's opacity Why this matters beyond SVMs: the same pattern — more capacity + leaky split → apparent win — repeats on every model family. The kernel-choice question is really a split-choice question in disguise. Schedule: - day 1: fix the split and scaler; rerun all four kernels under `GroupKFold(5)` - day 2: tune `C` on linear and on RBF (only `C` first, one grid each) - day 3: if the RBF lift is real under the honest split, then tune `gamma`; otherwise ship linear - day 4: evaluate the ranked short-list on the 20%-review-budget recall metric, not AUC alone Abandon the kernel escalation if: - the patient-grouped linear AUC sits inside ± 0.015 of RBF — the free upgrade is not free - recall-at-20%-budget on RBF is lower than linear even when AUC is higher — a ranking that reorders the middle of the list is not useful - the audit team cannot explain what the RBF decision means on a specific patient — in medical deployments this is often a hard constraint The practical lesson: **a kernel is a statement about the class geometry, not a performance knob**. When two kernels differ by less than the CV band, the simpler one wins by default. And when a split is leaky, no kernel ranking is trustworthy.

What To Do Next

  1. open SVM Margins and Kernels for the knob-by-knob tuning logic
  2. open Honest Splits and Baselines for the grouped-split recipe
  3. open Leakage Patterns for the scaler-leak pattern this clinic relies on
  4. rerun your own leaderboard under the correct group-aware split; if your top-of-board kernel survives, the win is real