Clinic 17
Kernel Choice
Linear, RBF, polynomial, or a handcrafted kernel — on a small tabular binary problem. The RBF result looks like a clear winner until you read how the split was made. Pick a kernel and defend it under the trap.
Situation
Four Kernels, One Trap
RBF's leaderboard number is two points above linear. The split is random over a dataset with heavy patient duplication across rows.
Your Job
Pick A Kernel
Choose linear, RBF, polynomial, or a domain kernel. Name what you would fix about the split before you trust the ranking.
Bad Habit To Avoid
Following The Scoreboard
A kernel that "wins" on a leaky split is information about the split, not about the kernel.
Situation¶
You are building a binary classifier for a medical tabular problem — 30 continuous features describing a patient visit, label is whether the visit resulted in a follow-up within 14 days.
Data and pipeline:
- 12,000 rows labeled. The rows are visits, not patients: a single patient can appear 3–8 times.
- the feature matrix is scaled once at the top and then split 80/20 at random with
random_state=0 - metric is ROC-AUC plus recall at a 20% review budget
Four SVM variants sit on the leaderboard:
| variant | test AUC | fit time | notes |
|---|---|---|---|
Linear (LinearSVC in a scaled pipeline) |
0.812 | 8 s | baseline |
RBF (SVC(kernel="rbf", C=1, gamma="scale")) |
0.834 | 2.5 min | one-line switch, looks like a free win |
Polynomial (SVC(kernel="poly", degree=3, coef0=1)) |
0.821 | 4 min | sensitive to degree and coef0 |
| Domain kernel (custom: scaled linear plus interaction terms for 3 flagged feature pairs) | 0.818 | 20 s | explainable, small lift |
A teammate has already written the PR "switch to RBF" and added a doc comment: "RBF is the new default — 2.2 points of AUC for free."
Two things about the setup should worry you:
- the 80/20 random split puts many of the same patients in both train and test. A patient seen on Monday and again on Friday can bleed across the split.
- the feature scaling is fitted on the full dataset, not inside a pipeline, so the test split has seen the train-set statistics
Artifact Packet¶
Kernel trade-offs on this size of problem:
| kernel | when it works | when it breaks | tuning surface |
|---|---|---|---|
| Linear | truly linear boundary; many features, modest nonlinearity | nonlinear class geometry | C only |
| RBF | smooth nonlinear decision boundary | leaks through gamma if unscaled; sensitive to bandwidth |
C and gamma |
| Polynomial | problems with explicit interaction structure | high variance at high degree; scaling matters |
C, degree, coef0 |
| Domain | when you know an interaction pattern (e.g., BMI × age) | requires hand-building per-problem | usually only C |
The trap to inspect:
- leaky split: random row-level split on a dataset with repeated patients. The model memorizes a patient on Monday and is tested on the same patient's Friday visit. RBF has enough capacity to exploit this; linear has less.
- scaler leak: fitting
StandardScaleron the full matrix before splitting means the test rows have already informed the train-set feature mean and variance. Not huge on this problem, but real. - tuning budget: RBF has two knobs; linear has one. On a leaky split, RBF wins partly because it has more places to overfit the leak.
Correct split is by patient: GroupKFold with groups=patient_id. Expected AUC under that split is typically 1–3 points lower across all kernels, and the gap between RBF and linear often collapses.
Decision Prompt¶
Write a six-sentence defense that answers:
- Which kernel do you pick, and under which split?
- What do you expect the RBF–linear gap to look like once the split is by patient?
- What is the single inspection that proves the current leaderboard is about the split, not the kernel?
- Before tuning
gamma, what should be fixed in the pipeline? - If the RBF lift survives the patient-grouped split, what does that tell you about the class geometry?
- When would the domain kernel be the right pick even if it is not the top of the leaderboard?
Strong Reasoning Looks Like¶
- fixes the split first: move to
GroupKFold(groups=patient_id)before any kernel decision; a leaky split makes every kernel ranking unreliable - fixes the scaler next: the
StandardScalermust live inside thePipelineso each fold fits its own scaler on train-only rows - picks linear as the honest baseline, only escalating to RBF if the patient-grouped AUC gap is ≥ 1 point with stable CV bands
- names the inspection: rerun all four kernels under
GroupKFold(n_splits=5)and compare the AUC bands; if the RBF band and the linear band overlap, RBF is not a real win - tunes
Cbeforegamma; only one knob at a time, and always inside cross-validation - treats the domain kernel as the right pick when the lift is explainable and the deployment context (medical audit) rewards explainability over a small AUC gain
Common Wrong Moves¶
- merging the RBF PR because "2.2 points of AUC is a lot" — on a leaky split, it is not
- re-tuning
gammabefore fixing the patient leak; anygammathat wins on a leaky split is a measurement of the leak - keeping the
StandardScaleroutside the pipeline — even a 0.3-point AUC lift from scaler leakage becomes the deciding factor at this margin - picking RBF because "nonlinear is more powerful"; capacity without generalization is a cost, not a feature
- dropping the domain kernel because it lost by 0.4 points; an explainable model with a 0.4-point tax is often the right ship in a medical context
- comparing fit times as a tiebreaker while ignoring the split — a leaderboard built on a bad split makes every other dimension secondary
Run The Clinic In Browser¶
Use the runner to rerun the four kernels under a patient-grouped split and watch the AUC bands collapse together.
Reference Reveal¶
Open only after you write the defense
The reference call is **fix the split before choosing the kernel**. Once `GroupKFold(groups=patient_id)` is in place with the `StandardScaler` inside the pipeline, the expected outcome is: | kernel | random-split AUC (leaky) | patient-grouped AUC (honest) | gap | | --- | --- | --- | --- | | Linear | 0.812 | **0.785 ± 0.012** | — | | RBF | 0.834 | 0.791 ± 0.015 | +0.006 (inside the band) | | Polynomial | 0.821 | 0.778 ± 0.020 | −0.007 | | Domain | 0.818 | 0.783 ± 0.010 | −0.002 | In this honest view the RBF lift is **inside the CV band**. RBF's apparent 2.2-point win was mostly exploitation of the patient leak. The right ship for this medical context is **linear SVM** or the **domain kernel**: - linear has the smallest CV band and the clearest interpretation — coefficients map to feature importance - the domain kernel is ~1 point behind RBF in the leaky view but statistically tied once the leak is fixed, and it is auditable (the interaction terms are named) - RBF is defensible *only* if a follow-up run shows its patient-grouped AUC consistently ≥ 1 point above linear across 3+ seeds — and even then the audit story must account for the kernel's opacity Why this matters beyond SVMs: the same pattern — more capacity + leaky split → apparent win — repeats on every model family. The kernel-choice question is really a split-choice question in disguise. Schedule: - day 1: fix the split and scaler; rerun all four kernels under `GroupKFold(5)` - day 2: tune `C` on linear and on RBF (only `C` first, one grid each) - day 3: if the RBF lift is real under the honest split, then tune `gamma`; otherwise ship linear - day 4: evaluate the ranked short-list on the 20%-review-budget recall metric, not AUC alone Abandon the kernel escalation if: - the patient-grouped linear AUC sits inside ± 0.015 of RBF — the free upgrade is not free - recall-at-20%-budget on RBF is lower than linear even when AUC is higher — a ranking that reorders the middle of the list is not useful - the audit team cannot explain what the RBF decision means on a specific patient — in medical deployments this is often a hard constraint The practical lesson: **a kernel is a statement about the class geometry, not a performance knob**. When two kernels differ by less than the CV band, the simpler one wins by default. And when a split is leaky, no kernel ranking is trustworthy.What To Do Next¶
- open SVM Margins and Kernels for the knob-by-knob tuning logic
- open Honest Splits and Baselines for the grouped-split recipe
- open Leakage Patterns for the scaler-leak pattern this clinic relies on
- rerun your own leaderboard under the correct group-aware split; if your top-of-board kernel survives, the win is real