Optimizers and Regularization¶

What This Is¶

This page is about one practical training question:

which knob is the smallest fix that actually improves validation — optimizer, learning rate, weight decay, dropout, or none of the above

The trap this page is trying to prevent is adding regularization without knowing which problem you are solving. Overfitting, underfitting, and optimization instability all look different in the curves, and they call for different fixes. Reaching for dropout when the real problem is a too-high learning rate makes both worse.

The working skill: read the curve, name the problem, apply the smallest fix, and see whether validation actually moved.

When You Use It¶

training curves look unstable, oscillating, or diverging
the validation loss has stalled and training loss keeps dropping (overfitting)
both losses are flat and bad (underfitting)
gradient norms explode or vanish
a fine-tune is drifting from the pretrained behavior faster than it should
you need a defensible choice between AdamW and SGD for a new task

The Three Forces¶

Every training curve is driven by three forces, and most fixes belong to one of them:

Force	Controls	Visible when	Typical fix
Optimization	how the weights move per step	instability, divergence, exploding gradients	LR, clipping, warmup, optimizer choice
Capacity	how much the model can memorize	overfitting vs underfitting gap	width, depth, weight decay, dropout
Data	what the model sees	everything upstream	better sampling, augmentation, cleaning

Almost every debugging mistake comes from reaching into the wrong column.

Optimizer Choice¶

A short, opinionated ladder:

AdamW — default for most deep learning. Adaptive per-parameter learning rates and decoupled weight decay.
SGD with momentum — strong default for CNNs on large vision datasets; generalizes better than Adam on some tasks.
Lion — more recent, sometimes beats AdamW on LLM pretraining. Same API, smaller memory footprint.
RMSprop / Adagrad — legacy; rarely a good first pick today.
Sophia, Muon, distributed Shampoo — cutting edge; expect some volatility and a larger tuning budget.

Starting points:

LLMs / transformers: AdamW, LR 1e-4 to 5e-4, weight decay 0.1, β₂ = 0.95
vision from scratch: SGD with momentum 0.9, LR 0.1 with cosine schedule, weight decay 5e-4
fine-tuning: AdamW, LR 1e-5 to 5e-5, weight decay 0.01, warmup 500–1000 steps

AdamW vs Adam — Why The W Matters¶

Original Adam added weight decay into the gradient, which interacts badly with the adaptive per-parameter LR. AdamW decouples weight decay from the gradient step:

# AdamW-style update (simplified)
g_t = gradient
m_t = β1 * m_{t-1} + (1 - β1) * g_t
v_t = β2 * v_{t-1} + (1 - β2) * g_t^2
θ_t = θ_{t-1} - lr * (m_t / (sqrt(v_t) + ε)  +  wd * θ_{t-1})
                                                ^^^^^^^^^^
                                       decoupled weight decay

Use AdamW unless you have a specific reason to use plain Adam. In PyTorch, this is a one-line switch: torch.optim.AdamW(...).

Weight Decay vs L2 — Subtle But Important¶

People often say "L2 regularization" and "weight decay" interchangeably. They are only equivalent for plain SGD.

L2 penalty: adds λ ||θ||² to the loss. Gradients include 2λ θ.
Weight decay: multiplies θ by (1 - lr * λ) every step.

For adaptive optimizers (Adam, RMSprop), these two diverge. AdamW applies weight decay directly, which is almost always what you actually want.

Practical tuning:

weight decay 0.1 for LLM pretraining
0.05 for LLM fine-tune
0.01 for most vision fine-tunes
5e-4 for vision from scratch with SGD + momentum

Learning Rate Choice¶

The single most important hyperparameter. A wrong LR makes every other fix look ineffective.

LR finder¶

Run a short scan from 1e-7 up to 1.0 over a few hundred steps and plot loss vs LR. The sweet spot is the region just before the loss explodes.

# minimal LR finder sketch
lrs = np.logspace(-7, 0, num=200)
losses = []
for i, batch in enumerate(train_loader):
    if i >= len(lrs):
        break
    for g in optimizer.param_groups:
        g["lr"] = lrs[i]
    loss = step(batch)
    losses.append(loss.item())
plt.semilogx(lrs, losses)

Pick an LR at or slightly below the steepest downward slope. This is more reliable than guessing.

Scale with batch size¶

Larger batch → larger LR, approximately linearly, up to a point. The "linear scaling rule" (Goyal et al.) works well up to batch sizes around 8k for ImageNet-scale training.

Gradient Clipping¶

A cheap safeguard that prevents rare huge gradients from destroying the model:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

When to use it:

transformers and RNNs, always — 1.0 is a strong default
on any task where loss spikes show up in the training curve
during LR warmup, when the model is most vulnerable

When not to use it:

if your gradients are already well-behaved and clipping is not firing; it adds overhead for no benefit
with very aggressive clipping (max_norm < 0.5) where you are hiding a real LR problem

Learning Rate Warmup¶

Start at a small LR and linearly increase to the target over the first N steps:

from torch.optim.lr_scheduler import LambdaLR

def warmup_linear(step, warmup_steps):
    return min(1.0, step / warmup_steps)

scheduler = LambdaLR(optimizer, lambda s: warmup_linear(s, warmup_steps=1000))

Warmup protects randomly-initialized or newly-fine-tuned weights from the first few huge updates. A good default for transformers is 500–2000 steps.

Gradient Accumulation¶

If your GPU cannot hold the batch size you need, accumulate gradients over several micro-batches:

accum = 4
optimizer.zero_grad()
for i, batch in enumerate(train_loader):
    loss = forward(batch) / accum
    loss.backward()
    if (i + 1) % accum == 0:
        optimizer.step()
        optimizer.zero_grad()

This gives you the effective batch size of 4× without the memory. Interacts correctly with AdamW, learning-rate schedulers, and mixed precision. For DDP, scale your LR by the effective batch size, not the per-GPU batch size.

Exponential Moving Average (EMA)¶

Keep a second copy of the weights that is the exponential moving average of the training weights. Use it at inference for smoother, often better predictions:

from copy import deepcopy

class EMA:
    def __init__(self, model, decay=0.999):
        self.shadow = deepcopy(model).eval()
        for p in self.shadow.parameters():
            p.requires_grad = False
        self.decay = decay

    @torch.no_grad()
    def update(self, model):
        for s, p in zip(self.shadow.parameters(), model.parameters()):
            s.mul_(self.decay).add_(p, alpha=1 - self.decay)

Cheap, common in diffusion and semi-supervised training, usually adds 0.1–0.5% validation accuracy.

Regularization Tools¶

Each of these fixes a specific failure mode. Reach for them in this order:

Weight decay¶

The first-line regularizer for any optimizer. Covered above.

Dropout¶

Drop a fraction of activations during training. Strong for feed-forward networks and transformer FFNs; weaker for CNNs.

self.drop = nn.Dropout(0.1)

Rules:

default 0.1 in transformer FFN; 0.0 in attention unless you see overfitting
dropout in LSTM/GRU only works between stacked layers, not within one layer
dropout in BN layers rarely helps; BN already acts as regularization

Label smoothing¶

Replace one-hot labels with (1 - ε) on the target and ε / (K - 1) on the rest. Reduces overconfidence and improves calibration:

loss_fn = nn.CrossEntropyLoss(label_smoothing=0.1)

Data augmentation¶

Often worth more than any other regularizer. Covered in Data Augmentation and Vision Augmentation and Shift Robustness.

Mixup and CutMix¶

Blend two training examples and their labels. Strong regularizer for vision; does not always help NLP.

Early stopping¶

The simplest regularizer. Watch the validation metric and stop when it plateaus or reverses. Pair with checkpointing so you keep the best epoch, not the last.

Stochastic weight averaging¶

Average the weights from the last N epochs of training. Improves generalization; implemented via torch.optim.swa_utils.

Reading Curves Honestly¶

Training down, validation flat or up → capacity problem (overfit). Start with weight decay and augmentation; add dropout if still overfitting.
Both flat and bad → underfit or optimization problem. Try a higher LR, larger model, or longer training.
Training oscillates wildly → LR too high, missing gradient clipping, or missing warmup.
Validation worse than training even at epoch 1 → data leak or train/val distribution mismatch, not overfitting.
Loss spikes then recovers → unclipped gradient. Add clipping at norm 1.0.
Loss drops then explodes far into training → mixed-precision underflow, numerical issue, or LR schedule bug.

Do not reach for dropout before you have named the problem.

Interaction With Mixed Precision¶

torch.amp changes the numerical regime and can interact with everything above:

weight decay still works but the scaling matters; use AdamW
gradient clipping must happen after scaler.unscale_
very small LRs can underflow in fp16; use bf16 where available

See Mixed Precision Training for the full recipe.

What To Inspect¶

the LR trajectory (print optimizer.param_groups[0]['lr'] per step)
gradient norm over time; a clip rate above 20% suggests LR is too high
the training vs validation gap, not the absolute numbers
effective batch size after accumulation and DDP
whether weight decay is being applied to parameters it should not be (biases, LayerNorm)
the shape of the validation curve around the best epoch — is it a sharp minimum or a plateau?

A common refinement: exclude biases and LayerNorm parameters from weight decay.

def group_params_for_wd(model, wd):
    decay, no_decay = [], []
    for name, p in model.named_parameters():
        if not p.requires_grad:
            continue
        if p.dim() <= 1 or name.endswith(".bias") or "norm" in name.lower():
            no_decay.append(p)
        else:
            decay.append(p)
    return [
        {"params": decay, "weight_decay": wd},
        {"params": no_decay, "weight_decay": 0.0},
    ]

This is the default treatment in most modern transformer recipes.

Failure Pattern¶

Piling regularizers on top of an optimization problem. A training run that is unstable at LR 1e-3 does not need more dropout — it needs a lower LR, gradient clipping, or warmup. Regularization applied to an unstable optimizer produces a model that is both undertrained and underperforming.

Another failure: tuning one knob in isolation. LR and weight decay interact; LR and batch size interact; LR and warmup length interact. Ablating LR alone is rarely enough.

Common Mistakes¶

using plain Adam when AdamW was meant
applying weight decay to LayerNorm and bias parameters
turning up dropout to "fix" underfitting
using a from-scratch LR for a fine-tune
skipping warmup on transformer training
forgetting to scale LR when batch size changes
stacking regularizers (dropout + weight decay + label smoothing + augmentation) without ablating
reading training loss as the primary signal
leaving gradient clipping off "because it wasn't needed last time"
using the same LR across all parameter groups on a fine-tune

Practice¶

Run an LR finder on your current task and pick an LR from the curve.
Train the same model with and without weight decay excluded from biases and norms. Report the delta.
Introduce one training instability (LR too high) and show how warmup, clipping, and a lower LR each fix it.
Compare AdamW and SGD + momentum on the same task. Report validation metric and training time.
Add gradient accumulation to double your effective batch size. Verify the LR still needs to scale.
Train with and without label smoothing. Report validation accuracy and calibration (ECE).
Add EMA and compare EMA vs last-step weights on validation.
Show one curve pattern that calls for regularization and one that calls for a lower LR. Defend the diagnosis.

Runnable Example¶

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection¶

Continue with Learning Rate Schedulers for the schedule side of LR choice, PyTorch Optimization Recipes for end-to-end examples, and Optimization, Regularization, and PEFT for the full track that combines these choices into a defended workflow.