Clinic 17

Kernel Choice

Linear, RBF, polynomial, or a handcrafted kernel — on a small tabular binary problem. The RBF result looks like a clear winner until you read how the split was made. Pick a kernel and defend it under the trap.

Back To Clinics Open SVM Topic Open Honest Splits

Situation

Four Kernels, One Trap

RBF's leaderboard number is two points above linear. The split is random over a dataset with heavy patient duplication across rows.

Your Job

Pick A Kernel

Choose linear, RBF, polynomial, or a domain kernel. Name what you would fix about the split before you trust the ranking.

Bad Habit To Avoid

Following The Scoreboard

A kernel that "wins" on a leaky split is information about the split, not about the kernel.

Situation¶

You are building a binary classifier for a medical tabular problem — 30 continuous features describing a patient visit, label is whether the visit resulted in a follow-up within 14 days.

Data and pipeline:

12,000 rows labeled. The rows are visits, not patients: a single patient can appear 3–8 times.
the feature matrix is scaled once at the top and then split 80/20 at random with random_state=0
metric is ROC-AUC plus recall at a 20% review budget

Four SVM variants sit on the leaderboard:

variant	test AUC	fit time	notes
Linear (`LinearSVC` in a scaled pipeline)	0.812	8 s	baseline
RBF (`SVC(kernel="rbf", C=1, gamma="scale")`)	0.834	2.5 min	one-line switch, looks like a free win
Polynomial (`SVC(kernel="poly", degree=3, coef0=1)`)	0.821	4 min	sensitive to `degree` and `coef0`
Domain kernel (custom: scaled linear plus interaction terms for 3 flagged feature pairs)	0.818	20 s	explainable, small lift

A teammate has already written the PR "switch to RBF" and added a doc comment: "RBF is the new default — 2.2 points of AUC for free."

Two things about the setup should worry you:

the 80/20 random split puts many of the same patients in both train and test. A patient seen on Monday and again on Friday can bleed across the split.
the feature scaling is fitted on the full dataset, not inside a pipeline, so the test split has seen the train-set statistics

Artifact Packet¶

Kernel trade-offs on this size of problem:

kernel	when it works	when it breaks	tuning surface
Linear	truly linear boundary; many features, modest nonlinearity	nonlinear class geometry	`C` only
RBF	smooth nonlinear decision boundary	leaks through `gamma` if unscaled; sensitive to bandwidth	`C` and `gamma`
Polynomial	problems with explicit interaction structure	high variance at high `degree`; scaling matters	`C`, `degree`, `coef0`
Domain	when you know an interaction pattern (e.g., BMI × age)	requires hand-building per-problem	usually only `C`

The trap to inspect:

leaky split: random row-level split on a dataset with repeated patients. The model memorizes a patient on Monday and is tested on the same patient's Friday visit. RBF has enough capacity to exploit this; linear has less.
scaler leak: fitting StandardScaler on the full matrix before splitting means the test rows have already informed the train-set feature mean and variance. Not huge on this problem, but real.
tuning budget: RBF has two knobs; linear has one. On a leaky split, RBF wins partly because it has more places to overfit the leak.

Correct split is by patient: GroupKFold with groups=patient_id. Expected AUC under that split is typically 1–3 points lower across all kernels, and the gap between RBF and linear often collapses.

Decision Prompt¶

Write a six-sentence defense that answers:

Which kernel do you pick, and under which split?
What do you expect the RBF–linear gap to look like once the split is by patient?
What is the single inspection that proves the current leaderboard is about the split, not the kernel?
Before tuning gamma, what should be fixed in the pipeline?
If the RBF lift survives the patient-grouped split, what does that tell you about the class geometry?
When would the domain kernel be the right pick even if it is not the top of the leaderboard?

Strong Reasoning Looks Like¶

fixes the split first: move to GroupKFold(groups=patient_id) before any kernel decision; a leaky split makes every kernel ranking unreliable
fixes the scaler next: the StandardScaler must live inside the Pipeline so each fold fits its own scaler on train-only rows
picks linear as the honest baseline, only escalating to RBF if the patient-grouped AUC gap is ≥ 1 point with stable CV bands
names the inspection: rerun all four kernels under GroupKFold(n_splits=5) and compare the AUC bands; if the RBF band and the linear band overlap, RBF is not a real win
tunes C before gamma; only one knob at a time, and always inside cross-validation
treats the domain kernel as the right pick when the lift is explainable and the deployment context (medical audit) rewards explainability over a small AUC gain

Common Wrong Moves¶

merging the RBF PR because "2.2 points of AUC is a lot" — on a leaky split, it is not
re-tuning gamma before fixing the patient leak; any gamma that wins on a leaky split is a measurement of the leak
keeping the StandardScaler outside the pipeline — even a 0.3-point AUC lift from scaler leakage becomes the deciding factor at this margin
picking RBF because "nonlinear is more powerful"; capacity without generalization is a cost, not a feature
dropping the domain kernel because it lost by 0.4 points; an explainable model with a 0.4-point tax is often the right ship in a medical context
comparing fit times as a tiebreaker while ignoring the split — a leaderboard built on a bad split makes every other dimension secondary

Run The Clinic In Browser¶

Use the runner to rerun the four kernels under a patient-grouped split and watch the AUC bands collapse together.

Reference Reveal¶

Open only after you write the defense

The reference call is **fix the split before choosing the kernel**. Once `GroupKFold(groups=patient_id)` is in place with the `StandardScaler` inside the pipeline, the expected outcome is: | kernel | random-split AUC (leaky) | patient-grouped AUC (honest) | gap | | --- | --- | --- | --- | | Linear | 0.812 | **0.785 ± 0.012** | — | | RBF | 0.834 | 0.791 ± 0.015 | +0.006 (inside the band) | | Polynomial | 0.821 | 0.778 ± 0.020 | −0.007 | | Domain | 0.818 | 0.783 ± 0.010 | −0.002 | In this honest view the RBF lift is **inside the CV band**. RBF's apparent 2.2-point win was mostly exploitation of the patient leak. The right ship for this medical context is **linear SVM** or the **domain kernel**: - linear has the smallest CV band and the clearest interpretation — coefficients map to feature importance - the domain kernel is ~1 point behind RBF in the leaky view but statistically tied once the leak is fixed, and it is auditable (the interaction terms are named) - RBF is defensible *only* if a follow-up run shows its patient-grouped AUC consistently ≥ 1 point above linear across 3+ seeds — and even then the audit story must account for the kernel's opacity Why this matters beyond SVMs: the same pattern — more capacity + leaky split → apparent win — repeats on every model family. The kernel-choice question is really a split-choice question in disguise. Schedule: - day 1: fix the split and scaler; rerun all four kernels under `GroupKFold(5)` - day 2: tune `C` on linear and on RBF (only `C` first, one grid each) - day 3: if the RBF lift is real under the honest split, then tune `gamma`; otherwise ship linear - day 4: evaluate the ranked short-list on the 20%-review-budget recall metric, not AUC alone Abandon the kernel escalation if: - the patient-grouped linear AUC sits inside ± 0.015 of RBF — the free upgrade is not free - recall-at-20%-budget on RBF is lower than linear even when AUC is higher — a ranking that reorders the middle of the list is not useful - the audit team cannot explain what the RBF decision means on a specific patient — in medical deployments this is often a hard constraint The practical lesson: **a kernel is a statement about the class geometry, not a performance knob**. When two kernels differ by less than the CV band, the simpler one wins by default. And when a split is leaky, no kernel ranking is trustworthy.

What To Do Next¶

open SVM Margins and Kernels for the knob-by-knob tuning logic
open Honest Splits and Baselines for the grouped-split recipe
open Leakage Patterns for the scaler-leak pattern this clinic relies on
rerun your own leaderboard under the correct group-aware split; if your top-of-board kernel survives, the win is real