Vai al contenuto

Central Limit Theorem

The Story Behind the Math

In 1733, Abraham de Moivre (1667-1754), a French Huguenot mathematician living in London, was studying a problem that would change statistics forever. Working as a consultant to gamblers and insurers, he noticed something peculiar while calculating probabilities for coin tosses.

De Moivre was trying to approximate binomial probabilities when the number of trials became large. Computing \((1 + 1)^n\) for large \(n\) was tedious, and he needed shortcuts. Through brilliant insight, he discovered that the binomial distribution, when normalized, approached a smooth bell-shaped curve—the curve we now call the normal distribution.

The mystery that puzzled everyone: Why should the sum of many random variables, regardless of their individual distributions, always tend toward this particular curve? De Moivre had found a specific case (binomial), but the general principle remained hidden.

It wasn't until Pierre-Simon Laplace (1749-1827) in 1812 that the theorem began to take shape. Laplace, analyzing errors in astronomical observations, noticed that measurement errors—no matter their source—tended to accumulate into a normal distribution. He provided a more general proof, but it still had limitations.

The complete, rigorous proof had to wait for Aleksandr Lyapunov (1857-1918) in 1901. Using the revolutionary tool of characteristic functions (Fourier transforms of probability distributions), Lyapunov finally showed why the theorem holds under very general conditions.

The modern irony: We call it the "normal" distribution as if it's ordinary, but its emergence from almost any random process is one of the most remarkable phenomena in mathematics. The bell curve isn't just common—it's inevitable when enough randomness accumulates.

Why It Matters

The Central Limit Theorem is the reason the normal distribution appears everywhere:

  • Physics: Measurement errors in any experiment tend to be normal
  • Biology: Heights, weights, and biological measurements cluster around averages
  • Quality control: Manufacturing variations aggregate to normal distributions
  • Finance: Price movements over time periods become approximately normal
  • Polling: Sample means converge to normal distributions
  • Machine Learning: Many algorithms assume normality for tractability
  • Statistical testing: Most tests (t-tests, ANOVA, regression) rely on the CLT

Without understanding why and how the CLT works, we couldn't justify using normal approximations, construct confidence intervals, or perform most statistical inference.

Prerequisites

The Core Insight

The Statement

Let \(X_1, X_2, \ldots, X_n\) be independent and identically distributed (i.i.d.) random variables with: - Mean: \(\mu = E[X_i]\) - Variance: \(\sigma^2 = \text{Var}(X_i) < \infty\)

Define the sample mean:

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\]

As \(n \to \infty\), the standardized sample mean converges to a standard normal:

\[\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \overset{d}{\longrightarrow} \mathcal{N}(0, 1)\]

Or equivalently, the sample mean itself:

\[\bar{X}_n \overset{d}{\longrightarrow} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)\]

Key insight: The distribution of \(X_i\) doesn't matter (as long as it has finite variance)! Uniform, exponential, Bernoulli, anything—they all converge to normal when averaged.

The Complete Proof

Step 1: Characteristic Functions

The characteristic function of a random variable \(X\) is the Fourier transform of its PDF:

\[\varphi_X(t) = E[e^{itX}] = \int_{-\infty}^{\infty} e^{itx} f_X(x) \, dx\]

Why characteristic functions? They uniquely determine distributions, and—crucially—the characteristic function of a sum of independent variables is the product of their characteristic functions.

Step 2: Expansion of the Characteristic Function

For small \(t\), we can expand \(\varphi_X(t)\) using Taylor series around \(t = 0\):

\[\varphi_X(t) = E\left[1 + itX + \frac{(itX)^2}{2!} + \cdots\right]\]

Using linearity of expectation:

\[\varphi_X(t) = 1 + itE[X] - \frac{t^2}{2}E[X^2] + \cdots\]

Let \(\mu = E[X]\) and \(\sigma^2 = \text{Var}(X) = E[X^2] - \mu^2\). Then:

\[\varphi_X(t) = 1 + it\mu - \frac{t^2}{2}(\sigma^2 + \mu^2) + O(t^3)\]

Step 3: Centering and Scaling

Consider the standardized sum:

\[Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{1}{\sigma\sqrt{n}}\sum_{i=1}^n (X_i - \mu)\]

Let \(Y_i = X_i - \mu\). Then \(E[Y_i] = 0\) and \(\text{Var}(Y_i) = \sigma^2\).

The characteristic function of \(Y_i\):

\[\varphi_Y(t) = 1 - \frac{t^2\sigma^2}{2} + O(t^3)\]

(No linear term because \(E[Y] = 0\))

Step 4: Characteristic Function of the Sum

Since \(Y_i\) are independent:

\[\varphi_{\sum Y_i}(t) = \prod_{i=1}^n \varphi_{Y_i}(t) = \left(\varphi_Y(t)\right)^n\]

For the standardized variable \(Z_n = \frac{\sum Y_i}{\sigma\sqrt{n}}\):

\[\varphi_{Z_n}(t) = \varphi_{\sum Y_i}\left(\frac{t}{\sigma\sqrt{n}}\right) = \left[\varphi_Y\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n\]

Step 5: The Limit

Substitute the expansion:

\[\varphi_{Z_n}(t) = \left[1 - \frac{t^2}{2n} + O(n^{-3/2})\right]^n\]

As \(n \to \infty\), this converges to:

\[\varphi_{Z_n}(t) \longrightarrow e^{-t^2/2}\]

Why? Using the limit \((1 + x/n)^n \to e^x\) as \(n \to \infty\).

Step 6: Recognize the Target

The characteristic function \(e^{-t^2/2}\) is exactly the characteristic function of the standard normal distribution \(\mathcal{N}(0, 1)\)!

Since characteristic functions uniquely determine distributions:

\[Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \overset{d}{\longrightarrow} \mathcal{N}(0, 1)\]

Q.E.D. The Central Limit Theorem is proved.

Understanding the Structure

Why \(\sqrt{n}\)? - The sum grows as \(n\) (linearly) - But the standard deviation grows as \(\sqrt{n}\) (square root) - So we need to scale by \(\sqrt{n}\) to get a stable distribution - This is the mathematical reason behind the "square root law" of statistics

Why does the specific distribution of \(X_i\) not matter? - Only the first two moments (mean and variance) appear in the limit - Higher moments vanish as \(n \to \infty\) - The characteristic function expansion shows all distributions "look similar" near zero

Why is the result always normal? - The normal distribution is the fixed point of the averaging operation - Averaging two normals gives another normal - The CLT says: all distributions flow to this fixed point under repeated averaging

Visual Demonstration

Convergence in Action

Original Distribution n = 5 n = 30 n = 100
Uniform(0,1) Rough bell Smooth bell Very normal
Exponential(λ) Right skew Symmetric Normal
Bernoulli(0.5) Discrete Approaching continuous Normal
Bimodal mix Four peaks Merged Single peak

Key observation: Even highly non-normal distributions (exponential, bimodal) converge to the bell curve when averaged.

Practical Implications

When Can You Use the Normal Approximation?

Rule of thumb: \(n \geq 30\) often works well, but depends on: - How non-normal the original distribution is - How far in the tails you need accuracy - Whether you're looking at individual observations or extremes

Better guidelines: - Symmetric distributions: \(n \geq 10-20\) - Moderate skew: \(n \geq 30-50\) - Heavy tails: \(n \geq 100\) or more - Extreme quantiles: Always need larger \(n\)

Confidence Intervals via CLT

The CLT justifies the standard confidence interval formula:

\[\bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\]

Even when the population isn't normal, for large \(n\) the sampling distribution of \(\bar{X}\) is approximately normal.

The CLT in Modern Statistics

  • Bootstrap methods: Resampling exploits the CLT implicitly
  • Machine Learning: Many algorithms assume normality of errors
  • Hypothesis testing: Most tests assume normality via CLT
  • Bayesian inference: Posterior distributions often become normal (Bernstein-von Mises theorem)

Common Misconceptions

  1. "The data becomes normal": No! The sample mean becomes normal. Individual data points keep their original distribution.

  2. "It works for any sum": The sum itself diverges (grows without bound). You must standardize (subtract mean, divide by standard deviation) to get convergence.

  3. "All averages are normally distributed": Only approximately, and only for large \(n\). Small samples may be far from normal.

  4. "CLT applies to any statistic": No! Only to sums and means (and statistics that are asymptotically equivalent). Medians, maxima, and other statistics have different limiting distributions.

  5. "CLT works for dependent data": Not necessarily! The i.i.d. assumption is crucial. Correlated data requires more sophisticated versions (e.g., mixing conditions).

References

  • de Moivre, A. (1733). "Approximatio ad Summam Terminorum Binomii \((a+b)^n\) in Seriem Expansi." Supplementum II to Miscellanea Analytica.
  • Laplace, P. S. (1812). Théorie Analytique des Probabilités. Paris: Courcier.
  • Lyapunov, A. M. (1901). "Nouvelle Forme du Théorème sur la Limite de Probabilité." Mémoires de l'Académie Impériale des Sciences de St. Pétersbourg.
  • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley.
  • Billingsley, P. (1995). Probability and Measure (3rd ed.). Wiley.