Clinic 21

Embedding Reuse Or Retrain

A frozen off-the-shelf embedding gives you 0.78 F1 in an hour. A domain-refit embedding might give you 0.84 in a week — or 0.73 if the data is too small. Pick one, and pick the inspection that proves you right.

Back To Clinics Open Encoders Topic Open SSL Topic

Situation

Two Embeddings, One Budget

A frozen pretrained encoder is fast and decent. A domain-refit encoder is slow and might be worse. The team wants a commit by Thursday.

Your Job

Pick Reuse Or Retrain

Defend one under the compute and data size you actually have. Name the inspection that would flip the decision.

Bad Habit To Avoid

Refitting Because You Can

Refitting an encoder on 5k in-domain samples often destroys a pretrained representation learned on billions.

Situation¶

You are building a classifier for biomedical abstract triage — given a 200-word abstract, predict whether it is relevant to a given rare-disease review.

Resources:

labeled data: 5,200 (abstract, label) pairs, human-annotated for one narrow rare disease
unlabeled in-domain data: 80,000 biomedical abstracts (PubMed scrape, 6 months)
compute: one A100, 4 days
off-the-shelf encoder: a sentence-transformers model pretrained on general web + scientific text, 384 dim output
candidate domain encoder: the same model fine-tuned with MLM on the 80k unlabeled abstracts before freezing and using as a feature extractor

Results on the first pass:

approach	labeled data used	training time	held-out F1	expected variance
Frozen OTS + linear head	5.2k	~1 hr	0.78	± 0.01
Frozen domain-refit + linear head	5.2k + 80k MLM	~36 hr	0.82 (one run)	unknown — only one run
End-to-end fine-tune OTS on labels	5.2k	~4 hr	0.73	± 0.04 (overfits)
End-to-end fine-tune domain-refit on labels	5.2k + 80k MLM + 5.2k labels	~40 hr	0.79 (one run)	unknown

One week to commit. The team lead wants the best F1 that a review coordinator (not an engineer) will rerun next quarter.

Artifact Packet¶

Three axes to reason along:

labeled data volume — 5.2k is modest; full fine-tune risks overfitting
unlabeled in-domain volume — 80k abstracts is meaningful; enough for MLM refit to shift the distribution
deployment owner — a review coordinator will maintain this, which penalizes any approach that needs a deep-learning retraining loop to rerun

Patterns from the literature and from operational experience:

regime	typical best move	why
labels < 1k	freeze, use OTS	too little signal to adapt the encoder
labels 1–10k + in-domain unlabeled	MLM-refit then freeze	domain shift of the representation without overfitting to labels
labels 10k–100k	end-to-end fine-tune a small adapter	adapter layers can specialize without wiping the base
labels > 100k + compute	full fine-tune	enough signal to justify the capacity

Two diagnostics that would settle the debate:

embedding linear-probe gap: train a linear head on OTS vs. domain-refit; the F1 gap (4 points here, 0.78 → 0.82) is the signal
domain-shift distance: KL or MMD between OTS embeddings of web data and of the 80k abstracts — if the shift is large, refit is earning its runtime

Decision Prompt¶

Write a six-sentence defense that answers:

Which approach do you commit? Reuse OTS or do the domain MLM-refit?
Would you run end-to-end fine-tuning at all, given 5.2k labels?
What single inspection separates "real domain gain" from "random noise from one run"?
What would make you drop the MLM-refit plan?
How does "the review coordinator reruns this next quarter" affect the pick?
What is the minimum labeled-data count at which you would switch to full end-to-end fine-tuning?

Strong Reasoning Looks Like¶

picks domain MLM-refit + frozen features + linear head — the 4-point F1 lift is large, the pipeline remains simple, and the frozen feature extractor keeps deployment boring
explicitly rejects end-to-end fine-tune on 5.2k labels: both OTS end-to-end (0.73) and domain-refit end-to-end (0.79) show lower F1 than their frozen counterparts, which is the signature of label overfitting at this scale
demands 3 seeds for the domain-refit MLM before committing; one run at F1 0.82 is a draw from a distribution, not a measurement. The 3-seed mean is the number to ship on.
uses the linear-probe gap as the inspection: if 3-seed mean of domain-refit > OTS by > 2 points with non-overlapping bands, refit is earning its compute
accounts for the review coordinator: a frozen encoder + linear head is retrainable in one script call, which matters more than a 1-point F1 lift from a fragile end-to-end pipeline
treats "full end-to-end fine-tune" as the right pick only once labels are in the 10k–20k range and the labeling team can keep delivering

Common Wrong Moves¶

declaring the MLM-refit "the winner" from one run at 0.82 — the variance estimate is the first requirement, not the result
running end-to-end fine-tune because "fine-tuning is the serious answer" — on 5.2k labels, it overfits and regresses on this problem (0.73 vs 0.78 for OTS frozen)
skipping the MLM-refit because "it's slow" — 36 hours of compute is within the 4-day budget and the F1 lift is visible
deploying a pipeline the review coordinator cannot rerun — 6 months from now the model is stale and there is no one to retrain it
tuning the linear head on the held-out set to chase the last 0.5 F1 — that is exactly the evaluation pollution this clinic is trying to avoid
ignoring the domain-shift diagnostic — if OTS and refit embeddings are near-identical by MMD, the refit does not buy anything

Run The Clinic In Browser¶

Use the runner to linear-probe OTS vs. a sketch of domain-refit on a toy corpus and compute MMD between the two embedding distributions.

Reference Reveal¶

Open only after you write the defense

The reference call is **domain MLM-refit, then freeze, then linear head on top — ship that**. The decision ladder: 1. **Labels 5.2k, unlabeled 80k, compute 4 days** — this is the regime where MLM-refit is explicitly designed to win. Enough unlabeled data to move the representation; not enough labels to justify end-to-end fine-tune. 2. **Frozen features keep the deployment loop boring** — the review coordinator reruns `pipeline.fit(X, y)` with new labels; no GPU retraining required quarterly. 3. **End-to-end fine-tune regresses on this data size** — both OTS and domain-refit go down when the encoder is unfrozen on 5.2k labels; that is the textbook overfitting-to-a-small-label-set pattern. Mandatory pre-commit inspections: - **3-seed domain-refit MLM**: verify F1 mean across 3 seeds, report mean ± std; the 0.82 single-run number must survive as a 3-seed mean of ≥ 0.80 - **linear-probe OTS vs refit**: the gap must be ≥ 2 F1 points with non-overlapping bands; otherwise ship OTS and save the compute - **MMD between OTS and refit embeddings** on 1000 held-out abstracts: a sanity check that the refit has actually moved the representation, not just the seed - **cost of retraining**: time how long the MLM-refit takes end-to-end; if > 48 hours, warn the review coordinator that quarterly rerun is an engineer-in-the-loop task Expected outcome: - OTS frozen: 0.78 ± 0.01 (single run, already stable) - domain-refit frozen: 0.81 ± 0.015 (mean of 3 seeds) — honest estimate of the 0.82 single run - the shipped system uses domain-refit frozen + linear head, ~36h compute, retrainable by the review coordinator on a schedule Switch decisions for next cycle: - labels grow to 15k+ → try small adapter layers (LoRA-for-encoders) unfrozen on labels; full freeze may no longer be optimal - an inspection shows the classifier misclassifies rare subtopics (e.g., pediatric variants of the rare disease) — the answer is more labels on those subtopics, not a bigger model - MMD shows OTS and refit are nearly identical → skip the refit next time; the 36h of compute was wasted and the 4-point gap was noise Escalation if the 3-seed MLM-refit mean does *not* clear 0.80: - the unlabeled corpus may be too general — try a curated sub-corpus closer to the rare-disease literature - the MLM task itself may be too easy on this data — try a harder self-supervision task (e.g., contrastive, SimCSE-style) The practical lesson: **"retrain the encoder" is an attractive-sounding move that is wrong when labels are scarce**. The disciplined move is: freeze, refit on unlabeled data if the domain shift justifies it, measure the linear-probe gap, and ship the simpler pipeline if the gap is small. The encoder is rarely the first thing to change.

What To Do Next¶

open Vision and Text Encoders for the freeze-vs-adapt ladder
open Self-Supervised and Representation Learning — the MLM-refit recipe in detail
open Freeze Or Fine-Tune? — the adjacent classical-transfer clinic
measure the linear-probe gap and MMD on your own domain; if the gap is < 1 point, ship the frozen OTS and save the compute