Central Limit Theorem¶

The Story Behind the Math¶

In 1733, Abraham de Moivre (1667-1754), a French Huguenot mathematician living in London, was studying a problem that would change statistics forever. Working as a consultant to gamblers and insurers, he noticed something peculiar while calculating probabilities for coin tosses.

De Moivre was trying to approximate binomial probabilities when the number of trials became large. Computing \((1 + 1)^n\) for large \(n\) was tedious, and he needed shortcuts. Through brilliant insight, he discovered that the binomial distribution, when normalized, approached a smooth bell-shaped curve—the curve we now call the normal distribution.

The mystery that puzzled everyone: Why should the sum of many random variables, regardless of their individual distributions, always tend toward this particular curve? De Moivre had found a specific case (binomial), but the general principle remained hidden.

It wasn't until Pierre-Simon Laplace (1749-1827) in 1812 that the theorem began to take shape. Laplace, analyzing errors in astronomical observations, noticed that measurement errors—no matter their source—tended to accumulate into a normal distribution. He provided a more general proof, but it still had limitations.

The complete, rigorous proof had to wait for Aleksandr Lyapunov (1857-1918) in 1901. Using the revolutionary tool of characteristic functions (Fourier transforms of probability distributions), Lyapunov finally showed why the theorem holds under very general conditions.

The modern irony: We call it the "normal" distribution as if it's ordinary, but its emergence from almost any random process is one of the most remarkable phenomena in mathematics. The bell curve isn't just common—it's inevitable when enough randomness accumulates.

Why It Matters¶

The Central Limit Theorem is the reason the normal distribution appears everywhere:

Physics: Measurement errors in any experiment tend to be normal
Biology: Heights, weights, and biological measurements cluster around averages
Quality control: Manufacturing variations aggregate to normal distributions
Finance: Price movements over time periods become approximately normal
Polling: Sample means converge to normal distributions
Machine Learning: Many algorithms assume normality for tractability
Statistical testing: Most tests (t-tests, ANOVA, regression) rely on the CLT

Without understanding why and how the CLT works, we couldn't justify using normal approximations, construct confidence intervals, or perform most statistical inference.

Prerequisites¶

Sample-Mean-Estimator — Foundation of the sampling distribution
Gaussian-Distribution — The target distribution
Expected Value and Variance — Moments that converge
Independence of random variables
Basic complex numbers (for characteristic functions)
Fourier-Transform or characteristic functions (for rigorous proof)

The Core Insight¶

The Statement¶

Let \(X_1, X_2, \ldots, X_n\) be independent and identically distributed (i.i.d.) random variables with: - Mean: \(\mu = E[X_i]\) - Variance: \(\sigma^2 = \text{Var}(X_i) < \infty\)

Define the sample mean:

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\]

As \(n \to \infty\), the standardized sample mean converges to a standard normal:

\[\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \overset{d}{\longrightarrow} \mathcal{N}(0, 1)\]

Or equivalently, the sample mean itself:

\[\bar{X}_n \overset{d}{\longrightarrow} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)\]

Key insight: The distribution of \(X_i\) doesn't matter (as long as it has finite variance)! Uniform, exponential, Bernoulli, anything—they all converge to normal when averaged.

The Complete Proof¶

Step 1: Characteristic Functions¶

The characteristic function of a random variable \(X\) is the Fourier transform of its PDF:

\[\varphi_X(t) = E[e^{itX}] = \int_{-\infty}^{\infty} e^{itx} f_X(x) \, dx\]

Why characteristic functions? They uniquely determine distributions, and—crucially—the characteristic function of a sum of independent variables is the product of their characteristic functions.

Step 2: Expansion of the Characteristic Function¶

For small \(t\), we can expand \(\varphi_X(t)\) using Taylor series around \(t = 0\):

\[\varphi_X(t) = E\left[1 + itX + \frac{(itX)^2}{2!} + \cdots\right]\]

Using linearity of expectation:

\[\varphi_X(t) = 1 + itE[X] - \frac{t^2}{2}E[X^2] + \cdots\]

Let \(\mu = E[X]\) and \(\sigma^2 = \text{Var}(X) = E[X^2] - \mu^2\). Then:

\[\varphi_X(t) = 1 + it\mu - \frac{t^2}{2}(\sigma^2 + \mu^2) + O(t^3)\]

Step 3: Centering and Scaling¶

Consider the standardized sum:

\[Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{1}{\sigma\sqrt{n}}\sum_{i=1}^n (X_i - \mu)\]

Let \(Y_i = X_i - \mu\). Then \(E[Y_i] = 0\) and \(\text{Var}(Y_i) = \sigma^2\).

The characteristic function of \(Y_i\):

\[\varphi_Y(t) = 1 - \frac{t^2\sigma^2}{2} + O(t^3)\]

(No linear term because \(E[Y] = 0\))

Step 4: Characteristic Function of the Sum¶

Since \(Y_i\) are independent:

\[\varphi_{\sum Y_i}(t) = \prod_{i=1}^n \varphi_{Y_i}(t) = \left(\varphi_Y(t)\right)^n\]

For the standardized variable \(Z_n = \frac{\sum Y_i}{\sigma\sqrt{n}}\):

\[\varphi_{Z_n}(t) = \varphi_{\sum Y_i}\left(\frac{t}{\sigma\sqrt{n}}\right) = \left[\varphi_Y\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n\]

Step 5: The Limit¶

Substitute the expansion:

\[\varphi_{Z_n}(t) = \left[1 - \frac{t^2}{2n} + O(n^{-3/2})\right]^n\]

As \(n \to \infty\), this converges to:

\[\varphi_{Z_n}(t) \longrightarrow e^{-t^2/2}\]

Why? Using the limit \((1 + x/n)^n \to e^x\) as \(n \to \infty\).

Step 6: Recognize the Target¶

The characteristic function \(e^{-t^2/2}\) is exactly the characteristic function of the standard normal distribution \(\mathcal{N}(0, 1)\)!

Since characteristic functions uniquely determine distributions:

\[Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \overset{d}{\longrightarrow} \mathcal{N}(0, 1)\]

Q.E.D. The Central Limit Theorem is proved.

Understanding the Structure¶

Why \(\sqrt{n}\)? - The sum grows as \(n\) (linearly) - But the standard deviation grows as \(\sqrt{n}\) (square root) - So we need to scale by \(\sqrt{n}\) to get a stable distribution - This is the mathematical reason behind the "square root law" of statistics

Why does the specific distribution of \(X_i\) not matter? - Only the first two moments (mean and variance) appear in the limit - Higher moments vanish as \(n \to \infty\) - The characteristic function expansion shows all distributions "look similar" near zero

Why is the result always normal? - The normal distribution is the fixed point of the averaging operation - Averaging two normals gives another normal - The CLT says: all distributions flow to this fixed point under repeated averaging

Visual Demonstration¶

Convergence in Action¶

Original Distribution	n = 5	n = 30	n = 100
Uniform(0,1)	Rough bell	Smooth bell	Very normal
Exponential(λ)	Right skew	Symmetric	Normal
Bernoulli(0.5)	Discrete	Approaching continuous	Normal
Bimodal mix	Four peaks	Merged	Single peak

Key observation: Even highly non-normal distributions (exponential, bimodal) converge to the bell curve when averaged.

Practical Implications¶

When Can You Use the Normal Approximation?¶

Rule of thumb: \(n \geq 30\) often works well, but depends on: - How non-normal the original distribution is - How far in the tails you need accuracy - Whether you're looking at individual observations or extremes

Better guidelines: - Symmetric distributions: \(n \geq 10-20\) - Moderate skew: \(n \geq 30-50\) - Heavy tails: \(n \geq 100\) or more - Extreme quantiles: Always need larger \(n\)

Confidence Intervals via CLT¶

The CLT justifies the standard confidence interval formula:

\[\bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\]

Even when the population isn't normal, for large \(n\) the sampling distribution of \(\bar{X}\) is approximately normal.

The CLT in Modern Statistics¶

Bootstrap methods: Resampling exploits the CLT implicitly
Machine Learning: Many algorithms assume normality of errors
Hypothesis testing: Most tests assume normality via CLT
Bayesian inference: Posterior distributions often become normal (Bernstein-von Mises theorem)

Common Misconceptions¶

"The data becomes normal": No! The sample mean becomes normal. Individual data points keep their original distribution.
"It works for any sum": The sum itself diverges (grows without bound). You must standardize (subtract mean, divide by standard deviation) to get convergence.
"All averages are normally distributed": Only approximately, and only for large \(n\). Small samples may be far from normal.
"CLT applies to any statistic": No! Only to sums and means (and statistics that are asymptotically equivalent). Medians, maxima, and other statistics have different limiting distributions.
"CLT works for dependent data": Not necessarily! The i.i.d. assumption is crucial. Correlated data requires more sophisticated versions (e.g., mixing conditions).

Sample-Mean-Estimator — The statistic that converges
Gaussian-Distribution — The limiting distribution
Fourier-Transform — Mathematical tool used in proof
Law-of-Large-Numbers — Related but different (convergence of value, not distribution)
Delta-Method — Extends CLT to functions of random variables
Berry-Esseen-Theorem — Quantifies how fast convergence happens

References¶

de Moivre, A. (1733). "Approximatio ad Summam Terminorum Binomii \((a+b)^n\) in Seriem Expansi." Supplementum II to Miscellanea Analytica.
Laplace, P. S. (1812). Théorie Analytique des Probabilités. Paris: Courcier.
Lyapunov, A. M. (1901). "Nouvelle Forme du Théorème sur la Limite de Probabilité." Mémoires de l'Académie Impériale des Sciences de St. Pétersbourg.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley.
Billingsley, P. (1995). Probability and Measure (3rd ed.). Wiley.