Principal Component Analysis (PCA)¶
The Story Behind the Mathematics¶
The story of Principal Component Analysis begins in 1901, in the corridors of University College London, where Karl Pearson (1857-1936), the father of modern statistics, was grappling with a fundamental problem in geometry and data analysis.
Pearson was studying biological measurements — skull dimensions, body proportions, evolutionary traits. He noticed that many measurements were correlated: larger organisms tended to be larger in multiple dimensions simultaneously. This redundancy meant that high-dimensional data could potentially be described more simply.
Pearson's key question: Given points in high-dimensional space, how do we find the "best-fitting" lower-dimensional subspace that captures most of the variation?
In his 1901 paper "On Lines and Planes of Closest Fit to Systems of Points in Space," Pearson introduced what he called the method of principal axes. He showed that the optimal line through a cloud of points was the one that minimized the sum of squared perpendicular distances from points to the line — equivalently, the line that maximized the variance of the projected points.
Pearson's geometric insight: The first principal axis points in the direction of maximum variance. The second principal axis, orthogonal to the first, points in the direction of maximum remaining variance, and so on.
Remarkably, Harold Hotelling (1895-1973), an American statistician, independently rediscovered and extended this method in 1933. Hotelling developed the algebraic framework we use today, connecting the geometric intuition to eigenvalue decomposition of covariance matrices. He coined the term "Principal Components" and showed how to extract them systematically.
Hotelling's contributions:
- Formalized PCA using covariance matrices and eigenanalysis
- Developed statistical tests for significance of components
- Connected PCA to factor analysis and canonical correlation
Historical parallel: Interestingly, PCA was developed around the same time that Einstein was revolutionizing physics with relativity. Both involved finding the "right" coordinate system: Einstein sought coordinates that made physical laws simple; Pearson and Hotelling sought coordinates that made data variation simple.
The computational revolution: For decades, PCA remained computationally expensive, limiting its use to small datasets. The advent of computers in the 1960s-70s, combined with efficient algorithms like the QR algorithm and Singular Value Decomposition (SVD), made PCA practical for large-scale applications.
Modern significance: PCA is now one of the most widely used techniques in data science. It's the foundation for:
- 1960s-70s: Psychometrics and factor analysis (IQ tests, personality assessments)
- 1980s-90s: Image compression (JPEG precursors), face recognition (eigenfaces)
- 2000s: Genomics (gene expression analysis), finance (portfolio optimization)
- 2010s-present: Machine learning preprocessing, deep learning initialization, data visualization
Though over a century old, PCA remains essential: it's taught in every data science curriculum and used in virtually every field that deals with multivariate data.
Why It Matters¶
Principal Component Analysis is the workhorse of dimensionality reduction and exploratory data analysis. It's used in:
- Machine Learning: Feature extraction, data preprocessing, noise reduction
- Computer Vision: Face recognition (eigenfaces), image compression, object detection
- Genomics: Gene expression analysis, population genetics, variant discovery
- Finance: Portfolio optimization, risk management, factor models
- Neuroscience: Brain imaging analysis (fMRI, EEG), neural population analysis
- Chemistry: Spectroscopy analysis, quantitative structure-activity relationships (QSAR)
- Climate Science: Pattern extraction from spatiotemporal data
- Marketing: Customer segmentation, market research
- Quality Control: Process monitoring, fault detection
- Natural Language Processing: Latent semantic analysis, document clustering
- Psychology: Factor analysis, psychometric testing
- Physics: Data compression in high-energy physics experiments
PCA provides both interpretable low-dimensional representations and efficient compression, making it invaluable for understanding and working with high-dimensional data.
Prerequisites¶
- Linear algebra (Matrix Operations, Eigenvalues and Eigenvectors)
- Variance and Covariance
- Expected Value and basic statistics
- Matrix decompositions (especially eigenvalue decomposition)
- Understanding of orthogonality and projections
Fundamental Concepts¶
We'll build PCA from first principles, starting with the geometric intuition and progressing to the full mathematical framework.
The Dimensionality Reduction Problem¶
We have:
- Data matrix: \(\mathbf{X} \in \mathbb{R}^{n \times p}\)
- \(n\) observations (rows)
- \(p\) features/variables (columns)
- \(\mathbf{x}_i \in \mathbb{R}^p\): the \(i\)-th observation (row vector)
Goal: Find a lower-dimensional representation \(\mathbf{Z} \in \mathbb{R}^{n \times d}\) where \(d < p\), such that:
- Maximum variance is preserved: capture as much variation as possible
- Minimum information loss: enable approximate reconstruction of original data
- Orthogonal components: new dimensions are uncorrelated
Key insight: We want to find a new coordinate system (basis) where the data has special properties: maximum variance along the first axis, maximum remaining variance along the second axis (perpendicular to the first), etc.
Centering the Data¶
First step: Center the data by subtracting the mean.
Column means:
Centered data matrix:
where \(\mathbf{1}_n\) is an \(n\)-dimensional vector of ones.
Why center? PCA seeks directions of maximum variance. Without centering, the first PC would simply point toward the mean, which is not informative. Centering ensures we measure variance around the data's center of mass.
Convention: From now on, assume \(\mathbf{X}\) is already centered (i.e., column means are zero).
First Principal Component: Maximum Variance¶
Question: What direction \(\mathbf{w}_1 \in \mathbb{R}^p\) maximizes the variance of the projected data?
Projection: Project each observation \(\mathbf{x}_i\) onto direction \(\mathbf{w}_1\):
Variance of projections (since data is centered, mean is zero):
Using matrix notation:
Define the sample covariance matrix:
This is a \(p \times p\) symmetric positive semi-definite matrix.
Optimization problem:
Why the constraint? Without it, we could make variance arbitrarily large by scaling \(\mathbf{w}_1\). The constraint \(\|\mathbf{w}_1\| = 1\) makes the solution unique (up to sign).
Derivation: First Principal Component¶
Method: Lagrange multipliers.
Lagrangian:
First-order condition: Differentiate with respect to \(\mathbf{w}_1\) and set to zero.
Recall: \(\frac{\partial}{\partial \mathbf{w}}(\mathbf{w}^T \mathbf{A} \mathbf{w}) = 2\mathbf{A}\mathbf{w}\) for symmetric \(\mathbf{A}\).
This is an eigenvalue equation! \(\mathbf{w}_1\) is an eigenvector of \(\mathbf{S}\) with eigenvalue \(\lambda\).
Which eigenvector? Substitute back into the objective:
Result: The variance of the first principal component equals the eigenvalue \(\lambda\). To maximize variance, choose the largest eigenvalue \(\lambda_1\).
First Principal Component:
The scores (projections) are:
Second Principal Component: Maximum Remaining Variance¶
Question: What direction \(\mathbf{w}_2\) maximizes variance subject to being orthogonal to \(\mathbf{w}_1\)?
Optimization problem:
Lagrangian:
First-order condition:
Multiply both sides by \(\mathbf{w}_1^T\):
Since \(\mathbf{w}_1^T \mathbf{w}_2 = 0\) (orthogonality constraint) and \(\mathbf{w}_1^T \mathbf{w}_1 = 1\):
But \(\mathbf{S}\) is symmetric, so \(\mathbf{w}_1^T \mathbf{S} \mathbf{w}_2 = \mathbf{w}_2^T \mathbf{S} \mathbf{w}_1 = \mathbf{w}_2^T (\lambda_1 \mathbf{w}_1) = \lambda_1 \mathbf{w}_2^T \mathbf{w}_1 = 0\).
Therefore: \(\mu = 0\).
Result: The first-order condition simplifies to:
Same eigenvalue equation! But now we want the second-largest eigenvalue \(\lambda_2\).
Second Principal Component:
Beautiful property: If \(\mathbf{S}\) has distinct eigenvalues, its eigenvectors are automatically orthogonal (this is a theorem from linear algebra). So the orthogonality constraint is automatically satisfied.
General Case: All Principal Components¶
Eigendecomposition of covariance matrix:
where:
- \(\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_p]\) is a \(p \times p\) matrix of eigenvectors (principal component directions)
- \(\mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_p)\) is a diagonal matrix of eigenvalues
- Eigenvalues are ordered: \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0\)
- Eigenvectors are orthonormal: \(\mathbf{W}^T \mathbf{W} = \mathbf{I}\)
Principal components: The \(j\)-th principal component is:
Matrix form: Transform all observations simultaneously:
where \(\mathbf{Z}\) is the \(n \times p\) matrix of principal component scores.
Variance explained by PC \(j\): \(\text{Var}(z_j) = \lambda_j\)
Total variance:
This shows that PCA redistributes variance without changing the total variance.
Proportion of variance explained by PC \(j\):
Cumulative proportion (first \(d\) components):
Dimensionality Reduction¶
To reduce from \(p\) dimensions to \(d < p\) dimensions:
- Select the first \(d\) principal components: \(\mathbf{W}_d = [\mathbf{w}_1, \ldots, \mathbf{w}_d]\)
- Project: \(\mathbf{Z}_d = \mathbf{X} \mathbf{W}_d\) (dimension \(n \times d\))
How to choose \(d\)?
Method 1: Scree plot — plot eigenvalues and look for an "elbow"
Method 2: Cumulative variance threshold — choose \(d\) such that \(\sum_{j=1}^d \rho_j \geq \alpha\) (e.g., \(\alpha = 0.95\) for 95% variance explained)
Method 3: Kaiser criterion — keep components with \(\lambda_j > \bar{\lambda} = \frac{1}{p}\sum_{k=1}^p \lambda_k\) (only for standardized data)
Reconstruction from Principal Components¶
Given the reduced representation \(\mathbf{Z}_d\), we can approximately reconstruct the original data:
Reconstruction error (mean squared error):
where \(\|\cdot\|_F\) is the Frobenius norm.
Interpretation: The reconstruction error equals the average variance of the discarded components.
Connection to Singular Value Decomposition (SVD)¶
PCA is intimately connected to SVD, which is often the preferred computational method.
Singular Value Decomposition: Any matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) can be decomposed as:
where: - \(\mathbf{U} \in \mathbb{R}^{n \times n}\): left singular vectors (orthonormal) - \(\mathbf{D} \in \mathbb{R}^{n \times p}\): diagonal matrix of singular values \(d_1 \geq d_2 \geq \cdots \geq 0\) - \(\mathbf{V} \in \mathbb{R}^{p \times p}\): right singular vectors (orthonormal)
Connection to PCA: If \(\mathbf{X}\) is centered, then:
Result: - Principal component directions: \(\mathbf{W} = \mathbf{V}\) (right singular vectors) - Eigenvalues: \(\lambda_j = d_j^2 / n\) (squared singular values divided by \(n\)) - Principal component scores: \(\mathbf{Z} = \mathbf{X}\mathbf{V} = \mathbf{U}\mathbf{D}\)
Computational advantage: SVD is more numerically stable than computing eigenvalues of \(\mathbf{X}^T\mathbf{X}\) directly, especially when \(\mathbf{X}\) is poorly conditioned.
Data Standardization: When to Use It¶
Issue: If variables have different scales (e.g., height in cm vs. weight in kg), variables with larger variance will dominate the principal components.
Solution: Standardize (z-score normalize) each variable:
where \(s_j = \sqrt{\frac{1}{n}\sum_{i=1}^n (X_{ij} - \bar{X}_j)^2}\) is the standard deviation of variable \(j\).
Effect: After standardization, all variables have mean 0 and variance 1. PCA then operates on the correlation matrix instead of the covariance matrix:
When to standardize:
- Variables have different units or scales
- Want each variable to contribute equally (democratic weighting)
When NOT to standardize:
- Variables have the same units and scale
- The scale itself is meaningful (e.g., all measurements in grams)
- Want variance to reflect natural importance
Complete Worked Example¶
Problem: Perform PCA on a dataset of student exam scores in 3 subjects.
Data (already centered):
Four students, three subjects.
Step 1: Compute covariance matrix.
Step 2: Compute eigenvalues and eigenvectors.
Characteristic equation: \(\det(\mathbf{S} - \lambda \mathbf{I}) = 0\)
(Detailed computation omitted for brevity. Use numerical software.)
Eigenvalues (ordered):
- \(\lambda_1 \approx 2.12\)
- \(\lambda_2 \approx 0.36\)
- \(\lambda_3 \approx 0.02\)
Eigenvectors (normalized):
Step 3: Compute variance explained.
Total variance: \(\sum \lambda_j = 1.5 + 0.5 + 0.5 = 2.5\)
Proportions:
- PC1: \(\rho_1 = 2.12 / 2.5 = 0.848\) (84.8%)
- PC2: \(\rho_2 = 0.36 / 2.5 = 0.144\) (14.4%)
- PC3: \(\rho_3 = 0.02 / 2.5 = 0.008\) (0.8%)
Cumulative: PC1 + PC2 = 99.2% of variance
Step 4: Reduce to 2 dimensions.
Projected data:
Interpretation: We've reduced from 3 dimensions to 2, retaining 99.2% of the variance. Student 1 has highest score on PC1; students 3 and 4 have lowest scores.
Example Visualization¶
Below are the graphical visualizations of the example data:
Complete Analysis:

The figure shows: 1. Top left: Original 3D data (4 students, 3 subjects) 2. Top right: 2D projection after PCA - note how Student 1 (red) is far from center while Students 3 and 4 (green and orange) are close together 3. Bottom left: Scree plot with eigenvalues - PC1 explains 84.8% of variance 4. Bottom right: Bar chart of cumulative variance explained - with 2 components we reach 99.2%
Biplot:

The biplot shows simultaneously: - Student scores (colored points) in the space of the first 2 PCs - Loadings (black arrows) indicating how each subject contributes to the principal components
PCA for Visualization¶
One of the most common uses of PCA is data visualization in 2D or 3D.
Process:
- Compute first 2 (or 3) principal components
- Plot observations in PC space
- Color/label points by known groups or variables
Benefits:
- Reveals clusters and outliers
- Shows main patterns of variation
- Interpretable axes (if component loadings are examined)
Example applications:
- Genomics: Visualize population structure from genetic data
- Customer analytics: Segment customers in 2D
- Image data: Visualize image similarity
Loadings and Interpretation¶
The loadings are the elements of the eigenvectors \(\mathbf{w}_j\). They show how each original variable contributes to each PC.
Loading matrix: \(\mathbf{W} = [\mathbf{w}_1, \ldots, \mathbf{w}_p]\)
Interpretation: - \(w_{kj}\): contribution of original variable \(k\) to principal component \(j\) - Large \(|w_{kj}|\): variable \(k\) strongly influences PC \(j\) - Sign indicates direction of relationship
Biplot: Simultaneously plots observations (PC scores) and variables (loadings) on the same plot. Helps interpret what each PC represents.
Assumptions and Limitations¶
Assumptions:
- Linear relationships: PCA finds linear combinations; cannot capture nonlinear patterns
- Variance = information: Assumes directions of high variance are most important
- Orthogonality: Components are constrained to be uncorrelated
- Scale sensitivity: Sensitive to variable scaling (consider standardization)
Limitations:
- Interpretability: PCs are linear combinations of all variables; can be hard to interpret
- Outliers: Sensitive to outliers, which can distort principal directions
- Assumption violations: If data is not approximately multivariate normal, PCA may not capture important structure
- Categorical data: PCA is designed for continuous data; alternatives needed for categorical variables
When PCA may fail:
- Data lies on a nonlinear manifold (use kernel PCA or manifold learning instead)
- Important variation is in low-variance directions (e.g., signal vs. noise in certain contexts)
- Features are not correlated (PCA provides no benefit)
Variants and Extensions¶
Kernel PCA¶
For nonlinear dimensionality reduction, apply PCA in a high-dimensional feature space defined by a kernel function.
Kernel trick: Instead of explicitly computing features, use kernel \(k(\mathbf{x}_i, \mathbf{x}_j)\) to implicitly work in feature space.
Common kernels:
- Polynomial: \(k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T \mathbf{y} + c)^d\)
- RBF (Gaussian): \(k(\mathbf{x}, \mathbf{y}) = \exp(-\gamma \|\mathbf{x} - \mathbf{y}\|^2)\)
Sparse PCA¶
Standard PCA produces dense loadings (all variables contribute to each PC). Sparse PCA adds a sparsity penalty to encourage loadings to be exactly zero.
Benefit: Improved interpretability — each PC depends on only a subset of variables.
Method: Add \(L_1\) penalty (lasso) to the PCA optimization.
Robust PCA¶
Standard PCA is sensitive to outliers. Robust PCA decomposes data as:
where:
- \(\mathbf{L}\): low-rank matrix (clean signal)
- \(\mathbf{S}\): sparse matrix (outliers/noise)
Method: Solve via convex optimization (nuclear norm + \(L_1\) norm).
Probabilistic PCA (PPCA)¶
Treats PCA as a probabilistic latent variable model:
where \(\mathbf{z}_i \sim \mathcal{N}(0, \mathbf{I})\) (latent variables) and \(\mathbf{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 \mathbf{I})\) (noise).
Benefits:
- Principled handling of missing data
- Probabilistic framework allows Bayesian inference
- Connection to factor analysis
Estimation: Expectation-Maximization (EM) algorithm.
PCA vs. Other Methods¶
PCA vs. Factor Analysis (FA)¶
- PCA: Finds directions of maximum variance; deterministic
- FA: Models observed variables as linear functions of latent factors plus noise; probabilistic
When to use PCA: Dimensionality reduction, data compression
When to use FA: Modeling latent constructs (e.g., intelligence, personality traits) - FA: Models observed variables as linear functions of latent factors plus noise; probabilistic
When to use PCA: Dimensionality reduction, data compression When to use FA: Modeling latent constructs (e.g., intelligence, personality traits)
PCA vs. Linear Discriminant Analysis (LDA)¶
- PCA: Unsupervised; maximizes variance
- LDA: Supervised; maximizes class separation
When to use PCA: No labels, want to visualize or compress data
When to use LDA: Have class labels, want to classify or find discriminative features - LDA: Supervised; maximizes class separation
When to use PCA: No labels, want to visualize or compress data When to use LDA: Have class labels, want to classify or find discriminative features
See Linear-Discriminant-Analysis for details.
PCA vs. t-SNE / UMAP¶
- PCA: Linear, preserves global structure, fast
- t-SNE/UMAP: Nonlinear, preserves local structure, slower
When to use PCA: Large datasets, want global structure, need fast computation
When to use t-SNE/UMAP: Want to visualize clusters in 2D, local structure is important - t-SNE/UMAP: Nonlinear, preserves local structure, slower
When to use PCA: Large datasets, want global structure, need fast computation When to use t-SNE/UMAP: Want to visualize clusters in 2D, local structure is important
PCA vs. Autoencoders¶
- PCA: Linear, closed-form solution
- Autoencoders: Nonlinear (with nonlinear activations), learned via optimization
When to use PCA: Simple, interpretable, guaranteed optimal linear solution
When to use Autoencoders: Complex nonlinear relationships, very high dimensions - Autoencoders: Nonlinear (with nonlinear activations), learned via optimization
When to use PCA: Simple, interpretable, guaranteed optimal linear solution When to use Autoencoders: Complex nonlinear relationships, very high dimensions
Variables and Symbols¶
| Symbol | Name | Description |
|---|---|---|
| \(\mathbf{X}\) | Data matrix | \(n \times p\) matrix of observations |
| \(\mathbf{x}_i\) | Observation | \(i\)-th row of \(\mathbf{X}\) (a \(p\)-dimensional vector) |
| \(n\) | Sample size | Number of observations |
| \(p\) | Dimensionality | Number of original variables |
| \(d\) | Reduced dimension | Number of retained principal components |
| \(\mathbf{S}\) | Covariance matrix | \(p \times p\) sample covariance matrix |
| \(\mathbf{w}_j\) | PC direction | \(j\)-th eigenvector (principal component direction) |
| \(\lambda_j\) | Eigenvalue | Variance explained by \(j\)-th principal component |
| \(z_{ij}\) | PC score | Projection of observation \(i\) onto PC \(j\) |
| \(\mathbf{Z}\) | Score matrix | \(n \times p\) matrix of principal component scores |
| \(\mathbf{W}\) | Loading matrix | \(p \times p\) matrix of eigenvectors |
Related Concepts¶
- Covariance Matrix — Foundation for PCA
- Eigenvalues and Eigenvectors — Core mathematical tool
- Linear-Discriminant-Analysis — Supervised analog of PCA
- Singular Value Decomposition — Computational method for PCA
- Variance — What PCA seeks to maximize
- Matrix Operations — Linear algebra foundations
- Gaussian Distribution — Probabilistic PCA assumes this distribution
Historical and Modern References¶
- Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." Philosophical Magazine, 2(11), 559-572.
- Hotelling, H. (1933). "Analysis of a complex of statistical variables into principal components." Journal of Educational Psychology, 24(6), 417-441.
- Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 14]
- Tipping, M. E., & Bishop, C. M. (1999). "Probabilistic principal component analysis." Journal of the Royal Statistical Society: Series B, 61(3), 611-622.
- Abdi, H., & Williams, L. J. (2010). "Principal component analysis." Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459.
- Shlens, J. (2014). "A tutorial on principal component analysis." arXiv preprint arXiv:1404.1100.