Principal Component Analysis¶
What This Is¶
Principal Component Analysis (PCA) finds an orthonormal basis where the directions — the principal components — are ordered by how much variance of the data they capture. Projecting onto the first k components gives the k-dimensional subspace that preserves the most variance under any linear map.
The practical lesson is that PCA is both a reduction (drop dimensions that carry mostly noise) and a lens (look at the structure of the data in directions you would never have picked by hand). Every wide feature matrix is worth plotting in its top-2 PCA projection at least once before fitting a model.
When You Use It¶
- pre-processing for distance-based methods (KNN, K-Means) on wide feature matrices
- visualizing high-dimensional data in 2-3 dimensions
- denoising — drop the low-variance tail and reconstruct
- whitening — decorrelate features so a downstream linear model sees an identity covariance
- pipeline speed — reduce to 50 components before a slow model
Do Not Use It When¶
- the signal lives in small-variance directions (rare but real — PCA can actively hurt)
- you need a nonlinear manifold view — use UMAP, t-SNE, or an autoencoder; see Advanced Clustering and Dimensionality Reduction
- the target is categorical and the classes are separated along a low-variance direction — LDA (linear discriminant analysis) is the supervised alternative
- features are unscaled and on different units — PCA on raw dollars plus raw percents will be dominated by dollars
The Derivation¶
Let X be a centered n × d data matrix (subtract the mean per feature). The sample covariance matrix is:
S = (1 / (n - 1)) X^T X
The principal components are the eigenvectors of S, ordered by decreasing eigenvalue. The i-th eigenvalue is the variance along the i-th component.
Equivalently, compute the singular value decomposition X = U Σ V^T. The columns of V are the principal components; the squared singular values divided by n - 1 are the eigenvalues of S. The SVD route is what scikit-learn uses because it is numerically stable and does not require forming S explicitly.
The Algorithm¶
1. center X — subtract the column mean
2. (optional) scale X — divide by column standard deviation
3. compute SVD: X = U Σ V^T
4. keep the first k columns of V; call them V_k
5. project: X_reduced = X V_k
To reconstruct: X_hat = X_reduced V_k^T + mean. The reconstruction is lossy by exactly the variance in the dropped components.
Tooling¶
PCAIncrementalPCAfor data that does not fit in memoryKernelPCAfor nonlinear variantsTruncatedSVD— PCA without centering, for sparse matricesn_components— integer or a float like0.95to keep 95% of varianceexplained_variance_ratio_components_— the eigenvectorsStandardScalerin a pipeline
Choosing the Number of Components¶
Three honest ways:
- cumulative explained variance — pick
ksuch that the sum ofexplained_variance_ratio_[:k]is at least 0.90 or 0.95 - scree plot — plot
explained_variance_ratio_vs. component index; take the elbow - downstream metric — sweep
kand pick the value that maximizes validation accuracy of the downstream model
All three disagree sometimes. The downstream metric wins when you have one.
Scaling Decision¶
PCA is variance-maximizing, so features with larger numerical variance dominate. Two regimes:
- features share a unit (pixel intensities, normalized word counts) — centering is enough; scaling can erase real signal
- features have different units (dollars, percent, count) — scale before PCA, always
When in doubt, scale. You will notice immediately if PCA stops making sense.
Minimal Example¶
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca = make_pipeline(StandardScaler(), PCA(n_components=0.95))
X_reduced = pca.fit_transform(X_train)
Using n_components=0.95 tells PCA to keep as many components as needed to explain 95% of the variance. That is almost always a better first move than guessing the integer.
Interpreting Components¶
pca.components_[i] is the i-th eigenvector — a vector in the original feature space. The largest-magnitude entries tell you which original features move together under that component. Writing down the top 3-5 loadings for the top 2-3 components is the first interpretability pass.
Two warnings:
- component signs are arbitrary — flipping all the signs of a component is an equally valid answer
- components are not the same as latent factors in the scientific sense — interpretability is about reading the loadings, not declaring that PC1 is the concept
What To Inspect¶
- the scree curve — does it have a real elbow
- the 2D PC1-vs-PC2 scatter, colored by any label you have — does structure appear
- the reconstruction error for a few held-out rows — is the lossy reconstruction acceptable
- the top loadings of the top components — do they tell a coherent story
- whether the downstream model benefits — a reduction that hurts the downstream task is a failed reduction
- whether scaling was applied
Failure Pattern¶
Applying PCA to unscaled features that include a high-variance nuisance column (like a row ID used as a feature). PC1 becomes the nuisance column and the whole projection is useless.
A second failure pattern is assuming PCA preserves class separability. It maximizes variance, not discriminability. When the class boundary happens to be along a low-variance direction, PCA can destroy it. LDA is the supervised alternative.
A third failure pattern is reading components as "meaning." They are linear combinations of original features; the sign is arbitrary, and two components can together represent what one "natural" concept would.
Quick Checks¶
- Are the features centered (and usually scaled)?
- Is the number of components chosen from data, not guessed?
- Does PC1 vs. PC2 colored by label show any structure?
- Do the top loadings for PC1 tell a coherent story?
- Does the downstream model improve on PCA-reduced features, or just run faster?
Practice¶
- Apply PCA to a 2D correlated Gaussian and plot the principal directions.
- Reduce MNIST to 50 components and reconstruct. Show the reconstructed digits.
- Plot cumulative explained variance and pick
kat 90% and at 95%. Compare downstream accuracy. - Run PCA on unscaled features and on scaled features. Compare the top loadings.
- Compare PCA and LDA on a classification problem. Explain where each wins.
- Explain why the signs of PCA components are arbitrary.
- Describe one case where PCA destroys the signal a classifier needs.
- Explain the relationship between PCA and SVD.
- State what
IncrementalPCAbuys you over regular PCA. - Describe one situation where a nonlinear method (UMAP, autoencoder) is the right tool instead of PCA.
Longer Connection¶
PCA sits next to:
- Dimensionality Reduction — the surrounding workflow, including when to choose PCA vs. alternatives
- Advanced Clustering and Dimensionality Reduction — UMAP, t-SNE, and friends for nonlinear cases
- K-Nearest Neighbors — distance-based models that benefit most from PCA
- K-Means — the clustering method that most often pairs with PCA
- Feature Selection — the discrete cousin of dimensionality reduction
PCA is a lens first and a reduction second. Its job is to make the geometry of the data visible. Once you have seen it, you can decide whether reducing is even a good idea.