Skip to content

Principal Component Analysis

What This Is

Principal Component Analysis (PCA) finds an orthonormal basis where the directions — the principal components — are ordered by how much variance of the data they capture. Projecting onto the first k components gives the k-dimensional subspace that preserves the most variance under any linear map.

The practical lesson is that PCA is both a reduction (drop dimensions that carry mostly noise) and a lens (look at the structure of the data in directions you would never have picked by hand). Every wide feature matrix is worth plotting in its top-2 PCA projection at least once before fitting a model.

When You Use It

  • pre-processing for distance-based methods (KNN, K-Means) on wide feature matrices
  • visualizing high-dimensional data in 2-3 dimensions
  • denoising — drop the low-variance tail and reconstruct
  • whitening — decorrelate features so a downstream linear model sees an identity covariance
  • pipeline speed — reduce to 50 components before a slow model

Do Not Use It When

  • the signal lives in small-variance directions (rare but real — PCA can actively hurt)
  • you need a nonlinear manifold view — use UMAP, t-SNE, or an autoencoder; see Advanced Clustering and Dimensionality Reduction
  • the target is categorical and the classes are separated along a low-variance direction — LDA (linear discriminant analysis) is the supervised alternative
  • features are unscaled and on different units — PCA on raw dollars plus raw percents will be dominated by dollars

The Derivation

Let X be a centered n × d data matrix (subtract the mean per feature). The sample covariance matrix is:

S = (1 / (n - 1)) X^T X

The principal components are the eigenvectors of S, ordered by decreasing eigenvalue. The i-th eigenvalue is the variance along the i-th component.

Equivalently, compute the singular value decomposition X = U Σ V^T. The columns of V are the principal components; the squared singular values divided by n - 1 are the eigenvalues of S. The SVD route is what scikit-learn uses because it is numerically stable and does not require forming S explicitly.

The Algorithm

1. center X — subtract the column mean
2. (optional) scale X — divide by column standard deviation
3. compute SVD: X = U Σ V^T
4. keep the first k columns of V; call them V_k
5. project: X_reduced = X V_k

To reconstruct: X_hat = X_reduced V_k^T + mean. The reconstruction is lossy by exactly the variance in the dropped components.

Tooling

  • PCA
  • IncrementalPCA for data that does not fit in memory
  • KernelPCA for nonlinear variants
  • TruncatedSVD — PCA without centering, for sparse matrices
  • n_components — integer or a float like 0.95 to keep 95% of variance
  • explained_variance_ratio_
  • components_ — the eigenvectors
  • StandardScaler in a pipeline

Choosing the Number of Components

Three honest ways:

  • cumulative explained variance — pick k such that the sum of explained_variance_ratio_[:k] is at least 0.90 or 0.95
  • scree plot — plot explained_variance_ratio_ vs. component index; take the elbow
  • downstream metric — sweep k and pick the value that maximizes validation accuracy of the downstream model

All three disagree sometimes. The downstream metric wins when you have one.

Scaling Decision

PCA is variance-maximizing, so features with larger numerical variance dominate. Two regimes:

  • features share a unit (pixel intensities, normalized word counts) — centering is enough; scaling can erase real signal
  • features have different units (dollars, percent, count) — scale before PCA, always

When in doubt, scale. You will notice immediately if PCA stops making sense.

Minimal Example

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pca = make_pipeline(StandardScaler(), PCA(n_components=0.95))
X_reduced = pca.fit_transform(X_train)

Using n_components=0.95 tells PCA to keep as many components as needed to explain 95% of the variance. That is almost always a better first move than guessing the integer.

Interpreting Components

pca.components_[i] is the i-th eigenvector — a vector in the original feature space. The largest-magnitude entries tell you which original features move together under that component. Writing down the top 3-5 loadings for the top 2-3 components is the first interpretability pass.

Two warnings:

  • component signs are arbitrary — flipping all the signs of a component is an equally valid answer
  • components are not the same as latent factors in the scientific sense — interpretability is about reading the loadings, not declaring that PC1 is the concept

What To Inspect

  • the scree curve — does it have a real elbow
  • the 2D PC1-vs-PC2 scatter, colored by any label you have — does structure appear
  • the reconstruction error for a few held-out rows — is the lossy reconstruction acceptable
  • the top loadings of the top components — do they tell a coherent story
  • whether the downstream model benefits — a reduction that hurts the downstream task is a failed reduction
  • whether scaling was applied

Failure Pattern

Applying PCA to unscaled features that include a high-variance nuisance column (like a row ID used as a feature). PC1 becomes the nuisance column and the whole projection is useless.

A second failure pattern is assuming PCA preserves class separability. It maximizes variance, not discriminability. When the class boundary happens to be along a low-variance direction, PCA can destroy it. LDA is the supervised alternative.

A third failure pattern is reading components as "meaning." They are linear combinations of original features; the sign is arbitrary, and two components can together represent what one "natural" concept would.

Quick Checks

  1. Are the features centered (and usually scaled)?
  2. Is the number of components chosen from data, not guessed?
  3. Does PC1 vs. PC2 colored by label show any structure?
  4. Do the top loadings for PC1 tell a coherent story?
  5. Does the downstream model improve on PCA-reduced features, or just run faster?

Practice

  1. Apply PCA to a 2D correlated Gaussian and plot the principal directions.
  2. Reduce MNIST to 50 components and reconstruct. Show the reconstructed digits.
  3. Plot cumulative explained variance and pick k at 90% and at 95%. Compare downstream accuracy.
  4. Run PCA on unscaled features and on scaled features. Compare the top loadings.
  5. Compare PCA and LDA on a classification problem. Explain where each wins.
  6. Explain why the signs of PCA components are arbitrary.
  7. Describe one case where PCA destroys the signal a classifier needs.
  8. Explain the relationship between PCA and SVD.
  9. State what IncrementalPCA buys you over regular PCA.
  10. Describe one situation where a nonlinear method (UMAP, autoencoder) is the right tool instead of PCA.

Longer Connection

PCA sits next to:

PCA is a lens first and a reduction second. Its job is to make the geometry of the data visible. Once you have seen it, you can decide whether reducing is even a good idea.