Principal Component Analysis¶

What This Is¶

Principal Component Analysis (PCA) finds an orthonormal basis where the directions — the principal components — are ordered by how much variance of the data they capture. Projecting onto the first k components gives the k-dimensional subspace that preserves the most variance under any linear map.

The practical lesson is that PCA is both a reduction (drop dimensions that carry mostly noise) and a lens (look at the structure of the data in directions you would never have picked by hand). Every wide feature matrix is worth plotting in its top-2 PCA projection at least once before fitting a model.

When You Use It¶

pre-processing for distance-based methods (KNN, K-Means) on wide feature matrices
visualizing high-dimensional data in 2-3 dimensions
denoising — drop the low-variance tail and reconstruct
whitening — decorrelate features so a downstream linear model sees an identity covariance
pipeline speed — reduce to 50 components before a slow model

Do Not Use It When¶

the signal lives in small-variance directions (rare but real — PCA can actively hurt)
you need a nonlinear manifold view — use UMAP, t-SNE, or an autoencoder; see Advanced Clustering and Dimensionality Reduction
the target is categorical and the classes are separated along a low-variance direction — LDA (linear discriminant analysis) is the supervised alternative
features are unscaled and on different units — PCA on raw dollars plus raw percents will be dominated by dollars

The Derivation¶

Let X be a centered n × d data matrix (subtract the mean per feature). The sample covariance matrix is:

S = (1 / (n - 1)) X^T X

The principal components are the eigenvectors of S, ordered by decreasing eigenvalue. The i-th eigenvalue is the variance along the i-th component.

Equivalently, compute the singular value decomposition X = U Σ V^T. The columns of V are the principal components; the squared singular values divided by n - 1 are the eigenvalues of S. The SVD route is what scikit-learn uses because it is numerically stable and does not require forming S explicitly.

The Algorithm¶

1. center X — subtract the column mean
2. (optional) scale X — divide by column standard deviation
3. compute SVD: X = U Σ V^T
4. keep the first k columns of V; call them V_k
5. project: X_reduced = X V_k

To reconstruct: X_hat = X_reduced V_k^T + mean. The reconstruction is lossy by exactly the variance in the dropped components.

Tooling¶

PCA
IncrementalPCA for data that does not fit in memory
KernelPCA for nonlinear variants
TruncatedSVD — PCA without centering, for sparse matrices
n_components — integer or a float like 0.95 to keep 95% of variance
explained_variance_ratio_
components_ — the eigenvectors
StandardScaler in a pipeline

Choosing the Number of Components¶

Three honest ways:

cumulative explained variance — pick k such that the sum of explained_variance_ratio_[:k] is at least 0.90 or 0.95
scree plot — plot explained_variance_ratio_ vs. component index; take the elbow
downstream metric — sweep k and pick the value that maximizes validation accuracy of the downstream model

All three disagree sometimes. The downstream metric wins when you have one.

Scaling Decision¶

PCA is variance-maximizing, so features with larger numerical variance dominate. Two regimes:

features share a unit (pixel intensities, normalized word counts) — centering is enough; scaling can erase real signal
features have different units (dollars, percent, count) — scale before PCA, always

When in doubt, scale. You will notice immediately if PCA stops making sense.

Minimal Example¶

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pca = make_pipeline(StandardScaler(), PCA(n_components=0.95))
X_reduced = pca.fit_transform(X_train)

Using n_components=0.95 tells PCA to keep as many components as needed to explain 95% of the variance. That is almost always a better first move than guessing the integer.

Interpreting Components¶

pca.components_[i] is the i-th eigenvector — a vector in the original feature space. The largest-magnitude entries tell you which original features move together under that component. Writing down the top 3-5 loadings for the top 2-3 components is the first interpretability pass.

Two warnings:

component signs are arbitrary — flipping all the signs of a component is an equally valid answer
components are not the same as latent factors in the scientific sense — interpretability is about reading the loadings, not declaring that PC1 is the concept

What To Inspect¶

the scree curve — does it have a real elbow
the 2D PC1-vs-PC2 scatter, colored by any label you have — does structure appear
the reconstruction error for a few held-out rows — is the lossy reconstruction acceptable
the top loadings of the top components — do they tell a coherent story
whether the downstream model benefits — a reduction that hurts the downstream task is a failed reduction
whether scaling was applied

Failure Pattern¶

Applying PCA to unscaled features that include a high-variance nuisance column (like a row ID used as a feature). PC1 becomes the nuisance column and the whole projection is useless.

A second failure pattern is assuming PCA preserves class separability. It maximizes variance, not discriminability. When the class boundary happens to be along a low-variance direction, PCA can destroy it. LDA is the supervised alternative.

A third failure pattern is reading components as "meaning." They are linear combinations of original features; the sign is arbitrary, and two components can together represent what one "natural" concept would.

Quick Checks¶

Are the features centered (and usually scaled)?
Is the number of components chosen from data, not guessed?
Does PC1 vs. PC2 colored by label show any structure?
Do the top loadings for PC1 tell a coherent story?
Does the downstream model improve on PCA-reduced features, or just run faster?

Practice¶

Apply PCA to a 2D correlated Gaussian and plot the principal directions.
Reduce MNIST to 50 components and reconstruct. Show the reconstructed digits.
Plot cumulative explained variance and pick k at 90% and at 95%. Compare downstream accuracy.
Run PCA on unscaled features and on scaled features. Compare the top loadings.
Compare PCA and LDA on a classification problem. Explain where each wins.
Explain why the signs of PCA components are arbitrary.
Describe one case where PCA destroys the signal a classifier needs.
Explain the relationship between PCA and SVD.
State what IncrementalPCA buys you over regular PCA.
Describe one situation where a nonlinear method (UMAP, autoencoder) is the right tool instead of PCA.

Longer Connection¶

PCA sits next to:

Dimensionality Reduction — the surrounding workflow, including when to choose PCA vs. alternatives
Advanced Clustering and Dimensionality Reduction — UMAP, t-SNE, and friends for nonlinear cases
K-Nearest Neighbors — distance-based models that benefit most from PCA
K-Means — the clustering method that most often pairs with PCA
Feature Selection — the discrete cousin of dimensionality reduction

PCA is a lens first and a reduction second. Its job is to make the geometry of the data visible. Once you have seen it, you can decide whether reducing is even a good idea.