Skip to content

Linear Discriminant Analysis (LDA)

The Story Behind the Mathematics

The story of Linear Discriminant Analysis begins in 1936, in the unlikely setting of agricultural research. Ronald Aylmer Fisher (1890-1962), the British statistician and geneticist we've encountered before, was working at the Rothamsted Experimental Station near London.

Fisher faced a practical taxonomic problem: given measurements of iris flowers (sepal length, sepal width, petal length, petal width), could he develop a mathematical method to classify them into different species (setosa, versicolor, virginica)?

This wasn't just botanical curiosity. The broader question was fundamental to all sciences: How can we use multiple measurements to systematically distinguish between groups?

In his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems," Fisher introduced what would become one of the most influential techniques in statistics and machine learning: Linear Discriminant Analysis.

Fisher's key insights:

  1. Dimensionality reduction: Instead of working with all measurements separately, find a single linear combination that best separates the groups
  2. Maximize separation: Choose the combination that makes the groups as far apart as possible relative to their internal variation
  3. Optimal projection: Project high-dimensional data onto a lower-dimensional space while preserving class separation

Fisher showed that this optimal direction could be found by solving an eigenvalue problem — a remarkable connection between statistics and linear algebra.

Historical context: Fisher developed LDA in parallel with his work on Likelihood-Based Statistics and ANOVA. He was deeply interested in finding optimal procedures — methods that extracted maximum information from data. LDA was optimal in a precise sense: it maximized the ratio of between-class to within-class variance.

Beyond botany: While Fisher's original application was classifying iris flowers, LDA quickly spread to other fields: - 1940s-50s: Face recognition and character recognition (early computer vision) - 1960s-70s: Medical diagnosis, discriminating between disease states - 1980s-90s: Finance (credit scoring), marketing (customer segmentation) - 2000s-present: Machine learning, genomics, neuroimaging

Modern significance: Though developed nearly 90 years ago, LDA remains widely used today. It's valued for its mathematical elegance, computational efficiency, and strong performance when its assumptions hold. It's also a foundational concept: understanding LDA is essential for grasping more advanced methods like Quadratic Discriminant Analysis (QDA), Regularized Discriminant Analysis (RDA), and even modern deep learning architectures.

Why It Matters

Linear Discriminant Analysis is a fundamental technique for supervised classification and dimensionality reduction. It's used in:

  • Medicine: Disease diagnosis from biomarkers, patient risk stratification
  • Biology: Species classification, genomic data analysis
  • Computer Vision: Face recognition, object detection (especially early methods)
  • Finance: Credit scoring, bankruptcy prediction, fraud detection
  • Marketing: Customer segmentation, churn prediction
  • Speech Recognition: Phoneme classification
  • Chemometrics: Chemical compound identification from spectroscopy
  • Neuroimaging: Brain state classification from fMRI/EEG data
  • Quality Control: Product defect classification
  • Biometrics: Fingerprint and iris recognition

LDA provides both a classifier (assigns new observations to classes) and a feature extractor (finds informative low-dimensional representations), making it versatile and interpretable.

Prerequisites

Fundamental Concepts

We'll build LDA from first principles, starting with the simplest case and generalizing.

The Classification Problem

We have: - Training data: \(n\) observations \(\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\}\) - \(\mathbf{x}_i \in \mathbb{R}^p\): feature vector (e.g., \(p\) measurements) - \(y_i \in \{1, 2, \ldots, K\}\): class label (\(K\) classes)

Goal: Given a new observation \(\mathbf{x}_{\text{new}}\), predict its class \(y_{\text{new}}\).

LDA approach: Assume each class has a multivariate Gaussian distribution with: - Different means \(\boldsymbol{\mu}_k\) for each class \(k\) - Shared covariance matrix \(\boldsymbol{\Sigma}\) (same for all classes)

This assumption is what makes the discriminant linear — when all classes share the same covariance, the quadratic terms in the Gaussian densities cancel out, leaving only linear terms (we'll derive this shortly).

Two-Class LDA: The Simplest Case

Let's start with \(K = 2\) classes to build intuition.

Notation: - Class 1: \(n_1\) observations with mean \(\boldsymbol{\mu}_1\) - Class 2: \(n_2\) observations with mean \(\boldsymbol{\mu}_2\) - Common covariance: \(\boldsymbol{\Sigma}\)

Question: We want to project the \(p\)-dimensional data onto a 1-dimensional line (direction \(\mathbf{w}\)). What direction \(\mathbf{w}\) best separates the two classes?

Projection: The projection of \(\mathbf{x}\) onto \(\mathbf{w}\) is the scalar

\[ z = \mathbf{w}^T \mathbf{x} \]

After projection: - Class 1 mean: \(m_1 = \mathbf{w}^T \boldsymbol{\mu}_1\) - Class 2 mean: \(m_2 = \mathbf{w}^T \boldsymbol{\mu}_2\) - Projected variance: \(s^2 = \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w}\)

Fisher's idea: We want the projected means to be far apart and the projected variances to be small.

Formally: Maximize the ratio

\[ J(\mathbf{w}) = \frac{(\text{distance between projected means})^2}{\text{sum of projected variances}} \]

Fisher's Linear Discriminant: Derivation

Between-class scatter (separation of means):

\[ \text{Between} = (m_1 - m_2)^2 = (\mathbf{w}^T \boldsymbol{\mu}_1 - \mathbf{w}^T \boldsymbol{\mu}_2)^2 = (\mathbf{w}^T (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2))^2 \]

Using \(a^2 = a^T a\) for scalars:

\[ \text{Between} = \mathbf{w}^T (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \mathbf{w} \]

Define between-class scatter matrix:

\[ \mathbf{S}_B = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \]

So: \(\text{Between} = \mathbf{w}^T \mathbf{S}_B \mathbf{w}\)

Within-class scatter (variation within each class):

Assuming equal covariance \(\boldsymbol{\Sigma}\) for both classes:

\[ \text{Within} = \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w} + \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w} = 2 \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w} \]

Define within-class scatter matrix (pooled covariance):

\[ \mathbf{S}_W = \boldsymbol{\Sigma} \]

(In practice, we estimate \(\mathbf{S}_W\) from data; we'll derive this later.)

Fisher's criterion:

\[ J(\mathbf{w}) = \frac{\mathbf{w}^T \mathbf{S}_B \mathbf{w}}{\mathbf{w}^T \mathbf{S}_W \mathbf{w}} \]

Optimization: Find \(\mathbf{w}\) that maximizes \(J(\mathbf{w})\).

Key observation: Since \(J(\mathbf{w}) = J(c\mathbf{w})\) for any constant \(c > 0\), we can impose the constraint \(\mathbf{w}^T \mathbf{S}_W \mathbf{w} = 1\).

Then: Maximize \(\mathbf{w}^T \mathbf{S}_B \mathbf{w}\) subject to \(\mathbf{w}^T \mathbf{S}_W \mathbf{w} = 1\).

Lagrangian:

\[ \mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{S}_B \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{S}_W \mathbf{w} - 1) \]

First-order condition: Differentiate with respect to \(\mathbf{w}\) and set to zero.

Recall: - \(\frac{\partial}{\partial \mathbf{w}} (\mathbf{w}^T \mathbf{A} \mathbf{w}) = 2 \mathbf{A} \mathbf{w}\) for symmetric \(\mathbf{A}\)

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W \mathbf{w} = 0 \]
\[ \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} \]

Multiply both sides by \(\mathbf{S}_W^{-1}\) (assuming \(\mathbf{S}_W\) is invertible):

\[ \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} \]

This is an eigenvalue problem! \(\mathbf{w}\) is an eigenvector of \(\mathbf{S}_W^{-1} \mathbf{S}_B\) with eigenvalue \(\lambda\).

Simplification for two classes:

Recall \(\mathbf{S}_B = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T\).

This is a rank-1 matrix (outer product of two vectors), so:

\[ \mathbf{S}_B \mathbf{w} = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \mathbf{w} = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \underbrace{[(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \mathbf{w}]}_{\text{scalar}} \]

Let \(c = (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T \mathbf{w}\) (a scalar). Then:

\[ \mathbf{S}_B \mathbf{w} = c (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \]

Substituting into the eigenvalue equation:

\[ c (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) = \lambda \mathbf{S}_W \mathbf{w} \]
\[ \mathbf{w} = \frac{c}{\lambda} \mathbf{S}_W^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \]

Since we only care about the direction, ignore the constant \(c/\lambda\):

\[ \mathbf{w} \propto \mathbf{S}_W^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \]

Result: The optimal discriminant direction is

\[ \mathbf{w}^* = \mathbf{S}_W^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \]

Interpretation: - If \(\mathbf{S}_W = \mathbf{I}\) (identity, meaning uncorrelated features with unit variance), then \(\mathbf{w} \propto (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)\) — the direction connecting the class means - In general, \(\mathbf{S}_W^{-1}\) accounts for correlation and different variances, rotating the direction to maximize separation

Classification Rule

Once we have the discriminant direction \(\mathbf{w}\), how do we classify a new point \(\mathbf{x}\)?

Approach: Project \(\mathbf{x}\) onto \(\mathbf{w}\) and compare to a threshold.

Projection: \(z = \mathbf{w}^T \mathbf{x}\)

Decision rule: Assign \(\mathbf{x}\) to class 1 if

\[ z > \tau \]

where \(\tau\) is a threshold.

Choosing threshold: A natural choice is the midpoint between projected class means:

\[ \tau = \frac{\mathbf{w}^T \boldsymbol{\mu}_1 + \mathbf{w}^T \boldsymbol{\mu}_2}{2} = \frac{\mathbf{w}^T (\boldsymbol{\mu}_1 + \boldsymbol{\mu}_2)}{2} \]

Full classification rule:

\[ \text{Classify } \mathbf{x} \text{ as class 1 if } \mathbf{w}^T \mathbf{x} > \mathbf{w}^T \frac{(\boldsymbol{\mu}_1 + \boldsymbol{\mu}_2)}{2} \]

Equivalently:

\[ \mathbf{w}^T \mathbf{x} - \mathbf{w}^T \frac{(\boldsymbol{\mu}_1 + \boldsymbol{\mu}_2)}{2} > 0 \]

This is a linear decision boundary: The boundary is a hyperplane defined by \(\mathbf{w}^T \mathbf{x} = \text{const}\).

Connection to Bayes' Rule

LDA can also be derived from Bayes' rule — this gives additional insight and shows it's optimal under Gaussian assumptions.

Bayes' rule:

\[ P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) P(Y = k)}{P(\mathbf{X} = \mathbf{x})} \]

Posterior probability of class \(k\):

\[ P(Y = k | \mathbf{X} = \mathbf{x}) \propto P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k) \]

Assume: - \(\mathbf{X} | Y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})\) (Gaussian class-conditional densities) - \(P(Y = k) = \pi_k\) (prior probabilities, estimated from class frequencies)

Class-conditional density (multivariate Gaussian):

\[ f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right) \]

Log-posterior (ignoring constants independent of \(k\)):

\[ \log P(Y = k | \mathbf{X} = \mathbf{x}) = \log f_k(\mathbf{x}) + \log \pi_k + \text{const} \]
\[ = -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k + \text{const} \]

Expand the quadratic:

\[ (\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) = \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x} - 2 \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \mathbf{x} + \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k \]

The term \(\mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x}\) is independent of \(k\), so drop it:

\[ \log P(Y = k | \mathbf{X} = \mathbf{x}) \propto \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \mathbf{x} - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log \pi_k \]

Define the discriminant function:

\[ \delta_k(\mathbf{x}) = \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log \pi_k \]

Classification rule: Assign \(\mathbf{x}\) to the class with the highest discriminant score:

\[ \hat{y} = \arg\max_{k} \delta_k(\mathbf{x}) \]

Observation: \(\delta_k(\mathbf{x})\) is linear in \(\mathbf{x}\) — hence "linear discriminant".

For two classes \((k \in \{1, 2\})\): Classify as class 1 if \(\delta_1(\mathbf{x}) > \delta_2(\mathbf{x})\):

\[ \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 - \frac{1}{2} \boldsymbol{\mu}_1^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 + \log \pi_1 > \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_2 - \frac{1}{2} \boldsymbol{\mu}_2^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_2 + \log \pi_2 \]

Rearranging:

\[ \mathbf{x}^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) > \frac{1}{2}(\boldsymbol{\mu}_1^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 - \boldsymbol{\mu}_2^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_2) + \log(\pi_2/\pi_1) \]

This matches the Fisher approach with: - \(\mathbf{w} = \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)\) - Threshold adjusted for prior probabilities

Estimating Parameters from Data

In practice, we don't know \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}\) — we must estimate them from training data.

Notation: Class \(k\) has \(n_k\) observations \(\{\mathbf{x}_i : y_i = k\}\).

Estimating Class Means

Sample mean for class \(k\):

\[ \hat{\boldsymbol{\mu}}_k = \frac{1}{n_k} \sum_{i: y_i = k} \mathbf{x}_i \]

Interpretation: Simple average of all observations in class \(k\).

Estimating the Shared Covariance Matrix

Pooled covariance estimator:

\[ \hat{\boldsymbol{\Sigma}} = \frac{1}{n - K} \sum_{k=1}^K \sum_{i: y_i = k} (\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)^T \]

Why \(n - K\) in the denominator? We lose \(K\) degrees of freedom estimating \(K\) class means (Bessel's correction).

Alternative form (within-class scatter matrix):

\[ \mathbf{S}_W = \sum_{k=1}^K \sum_{i: y_i = k} (\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)^T \]

Then: \(\hat{\boldsymbol{\Sigma}} = \mathbf{S}_W / (n - K)\)

Interpretation: Pool the sample covariances from all classes, weighting by class size.

Estimating Prior Probabilities

Empirical prior:

\[ \hat{\pi}_k = \frac{n_k}{n} \]

Interpretation: Proportion of training observations in class \(k\).

Alternative: If we have domain knowledge, we can use different priors (e.g., equal priors \(\pi_k = 1/K\)).

Multi-Class LDA: General Case

For \(K > 2\) classes, we want to find multiple discriminant directions that separate all classes.

Between-class scatter matrix (generalization):

\[ \mathbf{S}_B = \sum_{k=1}^K n_k (\hat{\boldsymbol{\mu}}_k - \hat{\boldsymbol{\mu}})(\hat{\boldsymbol{\mu}}_k - \hat{\boldsymbol{\mu}})^T \]

where \(\hat{\boldsymbol{\mu}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\) is the overall mean.

Within-class scatter matrix (same as before):

\[ \mathbf{S}_W = \sum_{k=1}^K \sum_{i: y_i = k} (\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)^T \]

Optimization: Find projection matrix \(\mathbf{W} \in \mathbb{R}^{p \times d}\) (columns are discriminant directions) that maximizes

\[ J(\mathbf{W}) = \frac{|\mathbf{W}^T \mathbf{S}_B \mathbf{W}|}{|\mathbf{W}^T \mathbf{S}_W \mathbf{W}|} \]

where \(|\cdot|\) denotes matrix determinant.

Solution: The discriminant directions are the eigenvectors corresponding to the largest eigenvalues of \(\mathbf{S}_W^{-1} \mathbf{S}_B\).

Number of discriminant directions: We can extract at most \(\min(p, K-1)\) discriminant directions, because: - \(\mathbf{S}_B\) has rank at most \(K - 1\) (it's constructed from \(K\) class means minus overall mean) - We can't extract more directions than the feature dimension \(p\)

Dimensionality reduction: Project data onto \(d < p\) dimensions:

\[ \mathbf{z}_i = \mathbf{W}^T \mathbf{x}_i \]

where \(\mathbf{W} \in \mathbb{R}^{p \times d}\) contains the top \(d\) eigenvectors.

Classification in reduced space: Use discriminant functions in the reduced space, or simply use nearest class mean.

Complete Worked Example

Problem: Classify iris flowers into 3 species using 2 features (simplified).

Data: - Class 1 (Setosa): 50 samples, \(\hat{\boldsymbol{\mu}}_1 = \begin{pmatrix} 5.0 \\ 3.4 \end{pmatrix}\) - Class 2 (Versicolor): 50 samples, \(\hat{\boldsymbol{\mu}}_2 = \begin{pmatrix} 6.0 \\ 2.8 \end{pmatrix}\) - Class 3 (Virginica): 50 samples, \(\hat{\boldsymbol{\mu}}_3 = \begin{pmatrix} 6.5 \\ 3.0 \end{pmatrix}\)

Pooled covariance (estimated):

\[ \hat{\boldsymbol{\Sigma}} = \begin{pmatrix} 0.30 & 0.10 \\ 0.10 & 0.20 \end{pmatrix} \]

Step 1: Compute inverse covariance.

\[ \hat{\boldsymbol{\Sigma}}^{-1} = \frac{1}{0.30 \times 0.20 - 0.10 \times 0.10} \begin{pmatrix} 0.20 & -0.10 \\ -0.10 & 0.30 \end{pmatrix} = \frac{1}{0.05} \begin{pmatrix} 0.20 & -0.10 \\ -0.10 & 0.30 \end{pmatrix} \]
\[ \hat{\boldsymbol{\Sigma}}^{-1} = \begin{pmatrix} 4 & -2 \\ -2 & 6 \end{pmatrix} \]

Step 2: Compute discriminant functions for each class (assuming equal priors \(\pi_k = 1/3\), so \(\log \pi_k\) drops out):

\[ \delta_k(\mathbf{x}) = \mathbf{x}^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\mu}}_k - \frac{1}{2} \hat{\boldsymbol{\mu}}_k^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\mu}}_k \]

For class 1:

\[ \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\mu}}_1 = \begin{pmatrix} 4 & -2 \\ -2 & 6 \end{pmatrix} \begin{pmatrix} 5.0 \\ 3.4 \end{pmatrix} = \begin{pmatrix} 20 - 6.8 \\ -10 + 20.4 \end{pmatrix} = \begin{pmatrix} 13.2 \\ 10.4 \end{pmatrix} \]
\[ \hat{\boldsymbol{\mu}}_1^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\mu}}_1 = 5.0 \times 13.2 + 3.4 \times 10.4 = 66 + 35.36 = 101.36 \]
\[ \delta_1(\mathbf{x}) = 13.2 x_1 + 10.4 x_2 - 50.68 \]

Similarly for classes 2 and 3 (calculations omitted for brevity).

Step 3: Classify a new observation \(\mathbf{x}_{\text{new}} = \begin{pmatrix} 5.5 \\ 3.0 \end{pmatrix}\).

Compute \(\delta_k(\mathbf{x}_{\text{new}})\) for \(k = 1, 2, 3\) and assign to the class with the highest value.

\[ \delta_1(\mathbf{x}_{\text{new}}) = 13.2 \times 5.5 + 10.4 \times 3.0 - 50.68 = 72.6 + 31.2 - 50.68 = 53.12 \]

(Compute \(\delta_2\) and \(\delta_3\) similarly and compare.)

LDA vs. Other Methods

LDA vs. Logistic Regression

  • LDA: Generative model (models \(P(\mathbf{X}|Y)\) and uses Bayes' rule)
  • Logistic Regression: Discriminative model (directly models \(P(Y|\mathbf{X})\))

When LDA is better: If Gaussian assumption holds, LDA is more efficient (uses all data to estimate covariance)

When Logistic is better: More robust to violations of Gaussian assumption

LDA vs. QDA (Quadratic Discriminant Analysis)

  • LDA: Assumes shared covariance \(\boldsymbol{\Sigma}\)
  • QDA: Allows different covariances \(\boldsymbol{\Sigma}_k\) for each class

QDA discriminant:

\[ \delta_k(\mathbf{x}) = -\frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k \]

This is quadratic in \(\mathbf{x}\) (because \(\boldsymbol{\Sigma}_k\) depends on \(k\)).

Trade-off: - LDA: Lower variance (fewer parameters), assumes shared covariance - QDA: Higher variance (more parameters), more flexible

LDA vs. Naive Bayes

Naive Bayes: Assumes features are conditionally independent given class (\(\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_p^2)\))

LDA: Allows correlations between features (full covariance matrix)

Assumptions and Limitations

Assumptions: 1. Gaussian class-conditionals: \(\mathbf{X} | Y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})\) 2. Shared covariance: All classes have the same \(\boldsymbol{\Sigma}\) 3. Sufficient data: \(n > p\) (more observations than features)

What happens if assumptions are violated?

  • Non-Gaussian data: LDA can still work reasonably (discriminant is still linear), but may not be optimal
  • Different covariances: Use QDA instead
  • High-dimensional (\(p \approx n\) or \(p > n\)): Use regularized LDA (shrink \(\hat{\boldsymbol{\Sigma}}\) toward diagonal)

Limitations: - Linear boundaries: Can't capture complex, nonlinear decision boundaries - Sensitivity to outliers: Sample mean and covariance are sensitive to outliers - Curse of dimensionality: Covariance estimation becomes unreliable in high dimensions

Regularized LDA

When \(p\) is large relative to \(n\), \(\hat{\boldsymbol{\Sigma}}\) is poorly estimated or singular.

Solution: Regularize (shrink) the covariance estimate.

Shrinkage Estimator

\[ \hat{\boldsymbol{\Sigma}}_{\text{reg}} = (1 - \alpha) \hat{\boldsymbol{\Sigma}} + \alpha \text{diag}(\hat{\boldsymbol{\Sigma}}) \]

where \(0 \leq \alpha \leq 1\) controls shrinkage: - \(\alpha = 0\): Standard LDA - \(\alpha = 1\): Naive Bayes (diagonal covariance)

Tuning \(\alpha\): Use cross-validation.

Ridge-Regularized LDA

Add a ridge penalty to ensure invertibility:

\[ \hat{\boldsymbol{\Sigma}}_{\text{ridge}} = \hat{\boldsymbol{\Sigma}} + \lambda \mathbf{I} \]

where \(\lambda > 0\) is a tuning parameter.

Variables and Symbols

Symbol Name Description
\(\mathbf{x}_i\) Feature vector \(p\)-dimensional observation
\(y_i\) Class label Discrete class \(\in \{1, \ldots, K\}\)
\(K\) Number of classes Total classes
\(n_k\) Class size Number of observations in class \(k\)
\(\boldsymbol{\mu}_k\) Class mean Mean vector for class \(k\)
\(\boldsymbol{\Sigma}\) Covariance matrix Shared covariance across classes
\(\mathbf{w}\) Discriminant direction Projection vector
\(\mathbf{S}_B\) Between-class scatter Measures separation of class means
\(\mathbf{S}_W\) Within-class scatter Measures variation within classes
\(\delta_k(\mathbf{x})\) Discriminant function Linear score for class \(k\)
\(\pi_k\) Prior probability \(P(Y = k)\)

Historical and Modern References

  • Fisher, R. A. (1936). "The use of multiple measurements in taxonomic problems." Annals of Eugenics, 7(2), 179-188.
  • Rao, C. R. (1948). "The utilization of multiple measurements in problems of biological classification." Journal of the Royal Statistical Society. Series B, 10(2), 159-203.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 4]
  • Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Classification (2nd ed.). Wiley.
  • McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley.