Principal Component Analysis (PCA)¶

The Story Behind the Mathematics¶

The story of Principal Component Analysis begins in 1901, in the corridors of University College London, where Karl Pearson (1857-1936), the father of modern statistics, was grappling with a fundamental problem in geometry and data analysis.

Pearson was studying biological measurements — skull dimensions, body proportions, evolutionary traits. He noticed that many measurements were correlated: larger organisms tended to be larger in multiple dimensions simultaneously. This redundancy meant that high-dimensional data could potentially be described more simply.

Pearson's key question: Given points in high-dimensional space, how do we find the "best-fitting" lower-dimensional subspace that captures most of the variation?

In his 1901 paper "On Lines and Planes of Closest Fit to Systems of Points in Space," Pearson introduced what he called the method of principal axes. He showed that the optimal line through a cloud of points was the one that minimized the sum of squared perpendicular distances from points to the line — equivalently, the line that maximized the variance of the projected points.

Pearson's geometric insight: The first principal axis points in the direction of maximum variance. The second principal axis, orthogonal to the first, points in the direction of maximum remaining variance, and so on.

Remarkably, Harold Hotelling (1895-1973), an American statistician, independently rediscovered and extended this method in 1933. Hotelling developed the algebraic framework we use today, connecting the geometric intuition to eigenvalue decomposition of covariance matrices. He coined the term "Principal Components" and showed how to extract them systematically.

Hotelling's contributions:

Formalized PCA using covariance matrices and eigenanalysis
Developed statistical tests for significance of components
Connected PCA to factor analysis and canonical correlation

Historical parallel: Interestingly, PCA was developed around the same time that Einstein was revolutionizing physics with relativity. Both involved finding the "right" coordinate system: Einstein sought coordinates that made physical laws simple; Pearson and Hotelling sought coordinates that made data variation simple.

The computational revolution: For decades, PCA remained computationally expensive, limiting its use to small datasets. The advent of computers in the 1960s-70s, combined with efficient algorithms like the QR algorithm and Singular Value Decomposition (SVD), made PCA practical for large-scale applications.

Modern significance: PCA is now one of the most widely used techniques in data science. It's the foundation for:

1960s-70s: Psychometrics and factor analysis (IQ tests, personality assessments)
1980s-90s: Image compression (JPEG precursors), face recognition (eigenfaces)
2000s: Genomics (gene expression analysis), finance (portfolio optimization)
2010s-present: Machine learning preprocessing, deep learning initialization, data visualization

Though over a century old, PCA remains essential: it's taught in every data science curriculum and used in virtually every field that deals with multivariate data.

Why It Matters¶

Principal Component Analysis is the workhorse of dimensionality reduction and exploratory data analysis. It's used in:

Machine Learning: Feature extraction, data preprocessing, noise reduction
Computer Vision: Face recognition (eigenfaces), image compression, object detection
Genomics: Gene expression analysis, population genetics, variant discovery
Finance: Portfolio optimization, risk management, factor models
Neuroscience: Brain imaging analysis (fMRI, EEG), neural population analysis
Chemistry: Spectroscopy analysis, quantitative structure-activity relationships (QSAR)
Climate Science: Pattern extraction from spatiotemporal data
Marketing: Customer segmentation, market research
Quality Control: Process monitoring, fault detection
Natural Language Processing: Latent semantic analysis, document clustering
Psychology: Factor analysis, psychometric testing
Physics: Data compression in high-energy physics experiments

PCA provides both interpretable low-dimensional representations and efficient compression, making it invaluable for understanding and working with high-dimensional data.

Prerequisites¶

Linear algebra (Matrix Operations, Eigenvalues and Eigenvectors)
Variance and Covariance
Expected Value and basic statistics
Matrix decompositions (especially eigenvalue decomposition)
Understanding of orthogonality and projections

Fundamental Concepts¶

We'll build PCA from first principles, starting with the geometric intuition and progressing to the full mathematical framework.

The Dimensionality Reduction Problem¶

We have:

Data matrix: \(\mathbf{X} \in \mathbb{R}^{n \times p}\)
\(n\) observations (rows)
\(p\) features/variables (columns)
\(\mathbf{x}_i \in \mathbb{R}^p\): the \(i\)-th observation (row vector)

Goal: Find a lower-dimensional representation \(\mathbf{Z} \in \mathbb{R}^{n \times d}\) where \(d < p\), such that:

Maximum variance is preserved: capture as much variation as possible
Minimum information loss: enable approximate reconstruction of original data
Orthogonal components: new dimensions are uncorrelated

Key insight: We want to find a new coordinate system (basis) where the data has special properties: maximum variance along the first axis, maximum remaining variance along the second axis (perpendicular to the first), etc.

Centering the Data¶

First step: Center the data by subtracting the mean.

Column means:

\[ \bar{\mathbf{x}} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \]

Centered data matrix:

\[ \tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}_n \bar{\mathbf{x}}^T \]

where \(\mathbf{1}_n\) is an \(n\)-dimensional vector of ones.

Why center? PCA seeks directions of maximum variance. Without centering, the first PC would simply point toward the mean, which is not informative. Centering ensures we measure variance around the data's center of mass.

Convention: From now on, assume \(\mathbf{X}\) is already centered (i.e., column means are zero).

First Principal Component: Maximum Variance¶

Question: What direction \(\mathbf{w}_1 \in \mathbb{R}^p\) maximizes the variance of the projected data?

Projection: Project each observation \(\mathbf{x}_i\) onto direction \(\mathbf{w}_1\):

\[ z_{i1} = \mathbf{x}_i^T \mathbf{w}_1 \]

Variance of projections (since data is centered, mean is zero):

\[ \text{Var}(z_1) = \frac{1}{n} \sum_{i=1}^n z_{i1}^2 = \frac{1}{n} \sum_{i=1}^n (\mathbf{x}_i^T \mathbf{w}_1)^2 \]

Using matrix notation:

\[ \text{Var}(z_1) = \frac{1}{n} \|\mathbf{X} \mathbf{w}_1\|^2 = \frac{1}{n} \mathbf{w}_1^T \mathbf{X}^T \mathbf{X} \mathbf{w}_1 \]

Define the sample covariance matrix:

\[ \mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X} \]

This is a \(p \times p\) symmetric positive semi-definite matrix.

Optimization problem:

\[ \max_{\mathbf{w}_1} \mathbf{w}_1^T \mathbf{S} \mathbf{w}_1 \quad \text{subject to} \quad \|\mathbf{w}_1\|^2 = 1 \]

Why the constraint? Without it, we could make variance arbitrarily large by scaling \(\mathbf{w}_1\). The constraint \(\|\mathbf{w}_1\| = 1\) makes the solution unique (up to sign).

Derivation: First Principal Component¶

Method: Lagrange multipliers.

Lagrangian:

\[ \mathcal{L}(\mathbf{w}_1, \lambda) = \mathbf{w}_1^T \mathbf{S} \mathbf{w}_1 - \lambda(\mathbf{w}_1^T \mathbf{w}_1 - 1) \]

First-order condition: Differentiate with respect to \(\mathbf{w}_1\) and set to zero.

Recall: \(\frac{\partial}{\partial \mathbf{w}}(\mathbf{w}^T \mathbf{A} \mathbf{w}) = 2\mathbf{A}\mathbf{w}\) for symmetric \(\mathbf{A}\).

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}_1} = 2\mathbf{S}\mathbf{w}_1 - 2\lambda\mathbf{w}_1 = 0 \]

\[ \mathbf{S}\mathbf{w}_1 = \lambda \mathbf{w}_1 \]

This is an eigenvalue equation! \(\mathbf{w}_1\) is an eigenvector of \(\mathbf{S}\) with eigenvalue \(\lambda\).

Which eigenvector? Substitute back into the objective:

\[ \mathbf{w}_1^T \mathbf{S} \mathbf{w}_1 = \mathbf{w}_1^T (\lambda \mathbf{w}_1) = \lambda \mathbf{w}_1^T \mathbf{w}_1 = \lambda \]

Result: The variance of the first principal component equals the eigenvalue \(\lambda\). To maximize variance, choose the largest eigenvalue \(\lambda_1\).

First Principal Component:

\[ \mathbf{w}_1 = \text{eigenvector of } \mathbf{S} \text{ corresponding to largest eigenvalue } \lambda_1 \]

The scores (projections) are:

\[ z_{i1} = \mathbf{x}_i^T \mathbf{w}_1 \]

Second Principal Component: Maximum Remaining Variance¶

Question: What direction \(\mathbf{w}_2\) maximizes variance subject to being orthogonal to \(\mathbf{w}_1\)?

Optimization problem:

\[ \max_{\mathbf{w}_2} \mathbf{w}_2^T \mathbf{S} \mathbf{w}_2 \quad \text{subject to} \quad \|\mathbf{w}_2\| = 1, \quad \mathbf{w}_2^T \mathbf{w}_1 = 0 \]

Lagrangian:

\[ \mathcal{L}(\mathbf{w}_2, \lambda, \mu) = \mathbf{w}_2^T \mathbf{S} \mathbf{w}_2 - \lambda(\mathbf{w}_2^T \mathbf{w}_2 - 1) - \mu \mathbf{w}_2^T \mathbf{w}_1 \]

First-order condition:

\[ 2\mathbf{S}\mathbf{w}_2 - 2\lambda\mathbf{w}_2 - \mu\mathbf{w}_1 = 0 \]

Multiply both sides by \(\mathbf{w}_1^T\):

\[ 2\mathbf{w}_1^T\mathbf{S}\mathbf{w}_2 - 2\lambda \mathbf{w}_1^T\mathbf{w}_2 - \mu \mathbf{w}_1^T\mathbf{w}_1 = 0 \]

Since \(\mathbf{w}_1^T \mathbf{w}_2 = 0\) (orthogonality constraint) and \(\mathbf{w}_1^T \mathbf{w}_1 = 1\):

\[ 2\mathbf{w}_1^T \mathbf{S} \mathbf{w}_2 - \mu = 0 \]

But \(\mathbf{S}\) is symmetric, so \(\mathbf{w}_1^T \mathbf{S} \mathbf{w}_2 = \mathbf{w}_2^T \mathbf{S} \mathbf{w}_1 = \mathbf{w}_2^T (\lambda_1 \mathbf{w}_1) = \lambda_1 \mathbf{w}_2^T \mathbf{w}_1 = 0\).

Therefore: \(\mu = 0\).

Result: The first-order condition simplifies to:

\[ \mathbf{S}\mathbf{w}_2 = \lambda \mathbf{w}_2 \]

Same eigenvalue equation! But now we want the second-largest eigenvalue \(\lambda_2\).

Second Principal Component:

\[ \mathbf{w}_2 = \text{eigenvector of } \mathbf{S} \text{ corresponding to second-largest eigenvalue } \lambda_2 \]

Beautiful property: If \(\mathbf{S}\) has distinct eigenvalues, its eigenvectors are automatically orthogonal (this is a theorem from linear algebra). So the orthogonality constraint is automatically satisfied.

General Case: All Principal Components¶

Eigendecomposition of covariance matrix:

\[ \mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T \]

where:

\(\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_p]\) is a \(p \times p\) matrix of eigenvectors (principal component directions)
\(\mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_p)\) is a diagonal matrix of eigenvalues
Eigenvalues are ordered: \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0\)
Eigenvectors are orthonormal: \(\mathbf{W}^T \mathbf{W} = \mathbf{I}\)

Principal components: The \(j\)-th principal component is:

\[ z_{ij} = \mathbf{x}_i^T \mathbf{w}_j = \sum_{k=1}^p x_{ik} w_{jk} \]

Matrix form: Transform all observations simultaneously:

\[ \mathbf{Z} = \mathbf{X} \mathbf{W} \]

where \(\mathbf{Z}\) is the \(n \times p\) matrix of principal component scores.

Variance explained by PC \(j\): \(\text{Var}(z_j) = \lambda_j\)

Total variance:

\[ \sum_{j=1}^p \lambda_j = \text{trace}(\mathbf{S}) = \sum_{j=1}^p \text{Var}(X_j) \]

This shows that PCA redistributes variance without changing the total variance.

Proportion of variance explained by PC \(j\):

\[ \rho_j = \frac{\lambda_j}{\sum_{k=1}^p \lambda_k} \]

Cumulative proportion (first \(d\) components):

\[ \sum_{j=1}^d \rho_j = \frac{\sum_{j=1}^d \lambda_j}{\sum_{k=1}^p \lambda_k} \]

Dimensionality Reduction¶

To reduce from \(p\) dimensions to \(d < p\) dimensions:

Select the first \(d\) principal components: \(\mathbf{W}_d = [\mathbf{w}_1, \ldots, \mathbf{w}_d]\)
Project: \(\mathbf{Z}_d = \mathbf{X} \mathbf{W}_d\) (dimension \(n \times d\))

How to choose \(d\)?

Method 1: Scree plot — plot eigenvalues and look for an "elbow"

Method 2: Cumulative variance threshold — choose \(d\) such that \(\sum_{j=1}^d \rho_j \geq \alpha\) (e.g., \(\alpha = 0.95\) for 95% variance explained)

Method 3: Kaiser criterion — keep components with \(\lambda_j > \bar{\lambda} = \frac{1}{p}\sum_{k=1}^p \lambda_k\) (only for standardized data)

Reconstruction from Principal Components¶

Given the reduced representation \(\mathbf{Z}_d\), we can approximately reconstruct the original data:

\[ \hat{\mathbf{X}} = \mathbf{Z}_d \mathbf{W}_d^T \]

Reconstruction error (mean squared error):

\[ \text{MSE} = \frac{1}{np} \|\mathbf{X} - \hat{\mathbf{X}}\|_F^2 = \frac{1}{p} \sum_{j=d+1}^p \lambda_j \]

where \(\|\cdot\|_F\) is the Frobenius norm.

Interpretation: The reconstruction error equals the average variance of the discarded components.

Connection to Singular Value Decomposition (SVD)¶

PCA is intimately connected to SVD, which is often the preferred computational method.

Singular Value Decomposition: Any matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) can be decomposed as:

\[ \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^T \]

where: - \(\mathbf{U} \in \mathbb{R}^{n \times n}\): left singular vectors (orthonormal) - \(\mathbf{D} \in \mathbb{R}^{n \times p}\): diagonal matrix of singular values \(d_1 \geq d_2 \geq \cdots \geq 0\) - \(\mathbf{V} \in \mathbb{R}^{p \times p}\): right singular vectors (orthonormal)

Connection to PCA: If \(\mathbf{X}\) is centered, then:

\[ \mathbf{S} = \frac{1}{n}\mathbf{X}^T \mathbf{X} = \frac{1}{n} \mathbf{V} \mathbf{D}^T \mathbf{D} \mathbf{V}^T = \mathbf{V} \left(\frac{\mathbf{D}^T \mathbf{D}}{n}\right) \mathbf{V}^T \]

Result: - Principal component directions: \(\mathbf{W} = \mathbf{V}\) (right singular vectors) - Eigenvalues: \(\lambda_j = d_j^2 / n\) (squared singular values divided by \(n\)) - Principal component scores: \(\mathbf{Z} = \mathbf{X}\mathbf{V} = \mathbf{U}\mathbf{D}\)

Computational advantage: SVD is more numerically stable than computing eigenvalues of \(\mathbf{X}^T\mathbf{X}\) directly, especially when \(\mathbf{X}\) is poorly conditioned.

Data Standardization: When to Use It¶

Issue: If variables have different scales (e.g., height in cm vs. weight in kg), variables with larger variance will dominate the principal components.

Solution: Standardize (z-score normalize) each variable:

\[ \tilde{X}_{ij} = \frac{X_{ij} - \bar{X}_j}{s_j} \]

where \(s_j = \sqrt{\frac{1}{n}\sum_{i=1}^n (X_{ij} - \bar{X}_j)^2}\) is the standard deviation of variable \(j\).

Effect: After standardization, all variables have mean 0 and variance 1. PCA then operates on the correlation matrix instead of the covariance matrix:

\[ \mathbf{R} = \frac{1}{n} \tilde{\mathbf{X}}^T \tilde{\mathbf{X}} \]

When to standardize:

Variables have different units or scales
Want each variable to contribute equally (democratic weighting)

When NOT to standardize:

Variables have the same units and scale
The scale itself is meaningful (e.g., all measurements in grams)
Want variance to reflect natural importance

Complete Worked Example¶

Problem: Perform PCA on a dataset of student exam scores in 3 subjects.

Data (already centered):

\[ \mathbf{X} = \begin{pmatrix} 2 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & 0 \\ -1 & 0 & -1 \end{pmatrix} \]

Four students, three subjects.

Step 1: Compute covariance matrix.

\[ \mathbf{S} = \frac{1}{4} \mathbf{X}^T \mathbf{X} = \frac{1}{4} \begin{pmatrix} 2 & 0 & -1 & -1 \\ 1 & 0 & -1 & 0 \\ 1 & 0 & 0 & -1 \end{pmatrix} \begin{pmatrix} 2 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & 0 \\ -1 & 0 & -1 \end{pmatrix} \]

\[ \mathbf{X}^T \mathbf{X} = \begin{pmatrix} 6 & 2 & 3 \\ 2 & 2 & 1 \\ 3 & 1 & 2 \end{pmatrix} \]

\[ \mathbf{S} = \begin{pmatrix} 1.5 & 0.5 & 0.75 \\ 0.5 & 0.5 & 0.25 \\ 0.75 & 0.25 & 0.5 \end{pmatrix} \]

Step 2: Compute eigenvalues and eigenvectors.

Characteristic equation: \(\det(\mathbf{S} - \lambda \mathbf{I}) = 0\)

(Detailed computation omitted for brevity. Use numerical software.)

Eigenvalues (ordered):

\(\lambda_1 \approx 2.12\)
\(\lambda_2 \approx 0.36\)
\(\lambda_3 \approx 0.02\)

Eigenvectors (normalized):

\[ \mathbf{w}_1 \approx \begin{pmatrix} 0.74 \\ 0.33 \\ 0.59 \end{pmatrix}, \quad \mathbf{w}_2 \approx \begin{pmatrix} -0.59 \\ 0.69 \\ 0.42 \end{pmatrix}, \quad \mathbf{w}_3 \approx \begin{pmatrix} 0.33 \\ 0.65 \\ -0.69 \end{pmatrix} \]

Step 3: Compute variance explained.

Total variance: \(\sum \lambda_j = 1.5 + 0.5 + 0.5 = 2.5\)

Proportions:

PC1: \(\rho_1 = 2.12 / 2.5 = 0.848\) (84.8%)
PC2: \(\rho_2 = 0.36 / 2.5 = 0.144\) (14.4%)
PC3: \(\rho_3 = 0.02 / 2.5 = 0.008\) (0.8%)

Cumulative: PC1 + PC2 = 99.2% of variance

Step 4: Reduce to 2 dimensions.

\[ \mathbf{W}_2 = [\mathbf{w}_1, \mathbf{w}_2] = \begin{pmatrix} 0.74 & -0.59 \\ 0.33 & 0.69 \\ 0.59 & 0.42 \end{pmatrix} \]

Projected data:

\[ \mathbf{Z}_2 = \mathbf{X} \mathbf{W}_2 = \begin{pmatrix} 2 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & 0 \\ -1 & 0 & -1 \end{pmatrix} \begin{pmatrix} 0.74 & -0.59 \\ 0.33 & 0.69 \\ 0.59 & 0.42 \end{pmatrix} \]

\[ \mathbf{Z}_2 \approx \begin{pmatrix} 2.50 & -0.39 \\ 0 & 0 \\ -1.07 & 0.26 \\ -1.33 & 0.17 \end{pmatrix} \]

Interpretation: We've reduced from 3 dimensions to 2, retaining 99.2% of the variance. Student 1 has highest score on PC1; students 3 and 4 have lowest scores.

Example Visualization¶

Below are the graphical visualizations of the example data:

Complete Analysis: PCA Example - Analysis

The figure shows: 1. Top left: Original 3D data (4 students, 3 subjects) 2. Top right: 2D projection after PCA - note how Student 1 (red) is far from center while Students 3 and 4 (green and orange) are close together 3. Bottom left: Scree plot with eigenvalues - PC1 explains 84.8% of variance 4. Bottom right: Bar chart of cumulative variance explained - with 2 components we reach 99.2%

Biplot: PCA Biplot

The biplot shows simultaneously: - Student scores (colored points) in the space of the first 2 PCs - Loadings (black arrows) indicating how each subject contributes to the principal components

PCA for Visualization¶

One of the most common uses of PCA is data visualization in 2D or 3D.

Process:

Compute first 2 (or 3) principal components
Plot observations in PC space
Color/label points by known groups or variables

Benefits:

Reveals clusters and outliers
Shows main patterns of variation
Interpretable axes (if component loadings are examined)

Example applications:

Genomics: Visualize population structure from genetic data
Customer analytics: Segment customers in 2D
Image data: Visualize image similarity

Loadings and Interpretation¶

The loadings are the elements of the eigenvectors \(\mathbf{w}_j\). They show how each original variable contributes to each PC.

Loading matrix: \(\mathbf{W} = [\mathbf{w}_1, \ldots, \mathbf{w}_p]\)

Interpretation: - \(w_{kj}\): contribution of original variable \(k\) to principal component \(j\) - Large \(|w_{kj}|\): variable \(k\) strongly influences PC \(j\) - Sign indicates direction of relationship

Biplot: Simultaneously plots observations (PC scores) and variables (loadings) on the same plot. Helps interpret what each PC represents.

Assumptions and Limitations¶

Assumptions:

Linear relationships: PCA finds linear combinations; cannot capture nonlinear patterns
Variance = information: Assumes directions of high variance are most important
Orthogonality: Components are constrained to be uncorrelated
Scale sensitivity: Sensitive to variable scaling (consider standardization)

Limitations:

Interpretability: PCs are linear combinations of all variables; can be hard to interpret
Outliers: Sensitive to outliers, which can distort principal directions
Assumption violations: If data is not approximately multivariate normal, PCA may not capture important structure
Categorical data: PCA is designed for continuous data; alternatives needed for categorical variables

When PCA may fail:

Data lies on a nonlinear manifold (use kernel PCA or manifold learning instead)
Important variation is in low-variance directions (e.g., signal vs. noise in certain contexts)
Features are not correlated (PCA provides no benefit)

Variants and Extensions¶

Kernel PCA¶

For nonlinear dimensionality reduction, apply PCA in a high-dimensional feature space defined by a kernel function.

Kernel trick: Instead of explicitly computing features, use kernel \(k(\mathbf{x}_i, \mathbf{x}_j)\) to implicitly work in feature space.

Common kernels:

Polynomial: \(k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T \mathbf{y} + c)^d\)
RBF (Gaussian): \(k(\mathbf{x}, \mathbf{y}) = \exp(-\gamma \|\mathbf{x} - \mathbf{y}\|^2)\)

Sparse PCA¶

Standard PCA produces dense loadings (all variables contribute to each PC). Sparse PCA adds a sparsity penalty to encourage loadings to be exactly zero.

Benefit: Improved interpretability — each PC depends on only a subset of variables.

Method: Add \(L_1\) penalty (lasso) to the PCA optimization.

Robust PCA¶

Standard PCA is sensitive to outliers. Robust PCA decomposes data as:

\[ \mathbf{X} = \mathbf{L} + \mathbf{S} \]

where:

\(\mathbf{L}\): low-rank matrix (clean signal)
\(\mathbf{S}\): sparse matrix (outliers/noise)

Method: Solve via convex optimization (nuclear norm + \(L_1\) norm).

Probabilistic PCA (PPCA)¶

Treats PCA as a probabilistic latent variable model:

\[ \mathbf{x}_i = \mathbf{W} \mathbf{z}_i + \mathbf{\mu} + \mathbf{\epsilon}_i \]

where \(\mathbf{z}_i \sim \mathcal{N}(0, \mathbf{I})\) (latent variables) and \(\mathbf{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 \mathbf{I})\) (noise).

Benefits:

Principled handling of missing data
Probabilistic framework allows Bayesian inference
Connection to factor analysis

Estimation: Expectation-Maximization (EM) algorithm.

PCA vs. Other Methods¶

PCA vs. Factor Analysis (FA)¶

PCA: Finds directions of maximum variance; deterministic
FA: Models observed variables as linear functions of latent factors plus noise; probabilistic

When to use PCA: Dimensionality reduction, data compression

When to use FA: Modeling latent constructs (e.g., intelligence, personality traits) - FA: Models observed variables as linear functions of latent factors plus noise; probabilistic

When to use PCA: Dimensionality reduction, data compression When to use FA: Modeling latent constructs (e.g., intelligence, personality traits)

PCA vs. Linear Discriminant Analysis (LDA)¶

PCA: Unsupervised; maximizes variance
LDA: Supervised; maximizes class separation

When to use PCA: No labels, want to visualize or compress data

When to use LDA: Have class labels, want to classify or find discriminative features - LDA: Supervised; maximizes class separation

When to use PCA: No labels, want to visualize or compress data When to use LDA: Have class labels, want to classify or find discriminative features

See Linear-Discriminant-Analysis for details.

PCA vs. t-SNE / UMAP¶

PCA: Linear, preserves global structure, fast
t-SNE/UMAP: Nonlinear, preserves local structure, slower

When to use PCA: Large datasets, want global structure, need fast computation

When to use t-SNE/UMAP: Want to visualize clusters in 2D, local structure is important - t-SNE/UMAP: Nonlinear, preserves local structure, slower

When to use PCA: Large datasets, want global structure, need fast computation When to use t-SNE/UMAP: Want to visualize clusters in 2D, local structure is important

PCA vs. Autoencoders¶

PCA: Linear, closed-form solution
Autoencoders: Nonlinear (with nonlinear activations), learned via optimization

When to use PCA: Simple, interpretable, guaranteed optimal linear solution

When to use Autoencoders: Complex nonlinear relationships, very high dimensions - Autoencoders: Nonlinear (with nonlinear activations), learned via optimization

When to use PCA: Simple, interpretable, guaranteed optimal linear solution When to use Autoencoders: Complex nonlinear relationships, very high dimensions

Variables and Symbols¶

Symbol	Name	Description
\(\mathbf{X}\)	Data matrix	\(n \times p\) matrix of observations
\(\mathbf{x}_i\)	Observation	\(i\)-th row of \(\mathbf{X}\) (a \(p\)-dimensional vector)
\(n\)	Sample size	Number of observations
\(p\)	Dimensionality	Number of original variables
\(d\)	Reduced dimension	Number of retained principal components
\(\mathbf{S}\)	Covariance matrix	\(p \times p\) sample covariance matrix
\(\mathbf{w}_j\)	PC direction	\(j\)-th eigenvector (principal component direction)
\(\lambda_j\)	Eigenvalue	Variance explained by \(j\)-th principal component
\(z_{ij}\)	PC score	Projection of observation \(i\) onto PC \(j\)
\(\mathbf{Z}\)	Score matrix	\(n \times p\) matrix of principal component scores
\(\mathbf{W}\)	Loading matrix	\(p \times p\) matrix of eigenvectors

Covariance Matrix — Foundation for PCA
Eigenvalues and Eigenvectors — Core mathematical tool
Linear-Discriminant-Analysis — Supervised analog of PCA
Singular Value Decomposition — Computational method for PCA
Variance — What PCA seeks to maximize
Matrix Operations — Linear algebra foundations
Gaussian Distribution — Probabilistic PCA assumes this distribution

Historical and Modern References¶

Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." Philosophical Magazine, 2(11), 559-572.
Hotelling, H. (1933). "Analysis of a complex of statistical variables into principal components." Journal of Educational Psychology, 24(6), 417-441.
Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 14]
Tipping, M. E., & Bishop, C. M. (1999). "Probabilistic principal component analysis." Journal of the Royal Statistical Society: Series B, 61(3), 611-622.
Abdi, H., & Williams, L. J. (2010). "Principal component analysis." Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459.
Shlens, J. (2014). "A tutorial on principal component analysis." arXiv preprint arXiv:1404.1100.