Gaussian Distribution (Normal Distribution) — PDF¶

The Formula¶

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

What It Means¶

This formula answers a deceptively simple question: if you know that data clusters around some central value and trails off on both sides, how likely is any particular value?

Plug in a value of \(x\) and this function tells you how "dense" the probability is at that point. It draws the famous bell curve — tall in the middle at \(\mu\), fading symmetrically as you move away, the speed of that fade controlled by \(\sigma\).

Heights of people, measurement errors, exam scores, thermal noise in electronics — an absurd number of things in nature follow this shape. Not because of coincidence, but because of a deep mathematical reason we'll get to.

Why It Works — The Story Behind the Formula¶

How a Bell Curve Was Born from a Coin Toss¶

The year is 1733. Abraham de Moivre, a French mathematician exiled in London, is working on a gambling problem. He wants to know: if you flip a fair coin 100 times, what's the probability of getting exactly 50 heads?

The binomial formula gives the exact answer, but it involves factorials so large they're basically impossible to compute by hand. De Moivre needs an approximation. After months of work, he discovers that as the number of coin flips grows, the binomial distribution starts to look like a smooth, symmetric curve — and he writes down the formula for that curve.

He had just discovered the normal distribution. But he saw it as a calculation trick, not a law of nature.

Gauss and the Theory of Errors¶

Fast forward to 1809. Carl Friedrich Gauss — arguably the greatest mathematician who ever lived — is working on a completely different problem: tracking the orbits of asteroids. Every astronomical observation has small errors, and Gauss wants to find the "true" value from noisy data.

He asks himself: what distribution of errors would make the arithmetic mean the best possible estimate? Working backwards from this requirement, he derives the bell curve. The same formula de Moivre found from coin flips, Gauss finds from the logic of measurement errors.

This is why it's called the "Gaussian" distribution — though de Moivre arguably got there first. (Stigler's Law of Eponymy strikes again: no scientific discovery is named after its actual discoverer.)

The Central Limit Theorem — Why It's Everywhere¶

But the real reason this distribution is so important came from Pierre-Simon Laplace, who proved something astonishing:

If you add up many small, independent random effects, the result is approximately normally distributed — no matter what the individual effects look like.

This is the Central Limit Theorem, and it's the reason the bell curve shows up everywhere:

Your height is the sum of many genetic and environmental factors
A measurement error is the sum of many tiny instrument imperfections
Exam scores reflect the accumulation of many small knowledge gaps or successes
Thermal noise in a wire is the sum of billions of random electron movements

The Gaussian distribution isn't some arbitrary shape that happens to fit data. It's the inevitable shape that emerges whenever many small independent things add up. It's the mathematical fingerprint of accumulated randomness.

Dissecting the Formula — Piece by Piece¶

Now let's understand why the formula looks the way it does.

The Exponential: \(e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\)¶

This is the heart of the bell curve. Let's unpack it from the inside out.

The \((x - \mu)\) part measures how far you are from the center. At \(x = \mu\), this is zero — you're at the peak.

The \(\frac{x-\mu}{\sigma}\) part normalizes that distance by the spread. If \(\sigma\) is large, being 10 units away from \(\mu\) isn't a big deal. If \(\sigma\) is small, it's huge. This ratio is called the z-score — it measures distance in units of standard deviations.

The squaring makes the function symmetric: \(x\) values equally far above and below \(\mu\) get the same probability. It also ensures the tails go to zero — the farther you go, the faster it drops.

The \(-\frac{1}{2}\) is a convention that makes the variance come out to exactly \(\sigma^2\). Gauss originally used \(e^{-x^2}\) and the factor emerged from his analysis. If we used \(e^{-\left(\frac{x-\mu}{\sigma}\right)^2}\) without the \(\frac{1}{2}\), the math would still work, but the variance would be \(\frac{\sigma^2}{2}\) instead of \(\sigma^2\). The half is there so that \(\sigma\) means exactly what we want it to mean.

The \(e^{(\cdots)}\) ensures the function is always positive (probabilities can't be negative) and gives us that smooth, infinitely-tailed decay. Why \(e\) specifically? Because \(e^{-x^2}\) is the only function that is its own Fourier transform (up to scaling) — it has deep ties to the structure of mathematics itself.

The Normalizing Constant: \(\frac{1}{\sigma\sqrt{2\pi}}\)¶

A probability density function must integrate to 1 over all possible values (total probability = 100%). The raw exponential \(e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\) doesn't do that on its own — its integral over \((-\infty, \infty)\) equals \(\sigma\sqrt{2\pi}\).

So we divide by \(\sigma\sqrt{2\pi}\) to fix it. That's all this factor does — it's the receipt for making the total area equal 1.

But why \(\sqrt{2\pi}\)? This comes from one of the most beautiful results in mathematics: the Gaussian integral.

\[ \int_{-\infty}^{\infty} e^{-x^2} \, dx = \sqrt{\pi} \]

This was first computed by Euler and later made rigorous by Poisson using a gorgeous trick: square the integral, convert to polar coordinates, and the problem collapses into a simple integral over a circle. The \(\pi\) appears because a circle is hiding inside the bell curve. (The factor becomes \(\sqrt{2\pi}\) instead of \(\sqrt{\pi}\) because of our \(\frac{1}{2}\) in the exponent.)

There's something almost poetic about it: the most important probability distribution in all of statistics secretly contains a circle.

Step-by-Step Derivation¶

Gauss's Approach (From Maximum Likelihood)¶

Gauss asked: if I have \(n\) measurements \(x_1, \ldots, x_n\) with errors following some unknown density \(f\), and I want the sample mean to be the maximum likelihood estimator, what must \(f\) look like?

Likelihood function: \(L = \prod_{i=1}^n f(x_i - \mu)\)
Log-likelihood: \(\ell = \sum_{i=1}^n \ln f(x_i - \mu)\)
Setting the derivative to zero at \(\hat{\mu} = \bar{x}\):

\[ \sum_{i=1}^n \frac{f'(x_i - \bar{x})}{f(x_i - \bar{x})} = 0 \]

For this to hold for any data, we need \(\frac{f'(t)}{f(t)} = ct\) for some constant \(c < 0\)
This differential equation has the solution \(f(t) = A \, e^{ct^2/2}\)
Setting \(c = -1/\sigma^2\) and normalizing gives us:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

The Gaussian is the only distribution for which the sample mean is the maximum likelihood estimator of the true mean. Gauss essentially derived it by asking: "what kind of universe would make averaging the right thing to do?"

Variables Explained¶

Symbol	Name	Description
\(f(x)\)	Probability density	How "concentrated" the probability is at the value \(x\)
\(x\)	Random variable	The value you're asking about
\(\mu\)	Mean (mu)	The center of the distribution — where the peak is
\(\sigma\)	Standard deviation (sigma)	How spread out the curve is — larger means wider and flatter
\(\sigma^2\)	Variance	The square of the standard deviation
\(e\)	Euler's number	\(\approx 2.71828\), the base of natural logarithms
\(\pi\)	Pi	\(\approx 3.14159\), appearing here because a circle hides in the Gaussian integral

Worked Examples¶

Example 1: Human Heights¶

Adult men's heights in a country follow \(\mu = 175\) cm, \(\sigma = 7\) cm. How dense is the probability at exactly 182 cm?

\[ f(182) = \frac{1}{7\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{182-175}{7}\right)^2} = \frac{1}{7\sqrt{2\pi}} \, e^{-\frac{1}{2}(1)^2} = \frac{1}{17.53} \cdot 0.6065 \approx 0.0346 \]

That's one standard deviation above the mean. About 3.46% probability density per cm at that point. Compare to the peak at \(\mu = 175\):

\[ f(175) = \frac{1}{7\sqrt{2\pi}} \, e^{0} = \frac{1}{17.53} \approx 0.0570 \]

So 182 cm has about 61% of the peak density — the curve hasn't fallen off too much at one sigma.

Example 2: The 68-95-99.7 Rule¶

For any normal distribution: - 68% of values fall within \(\mu \pm 1\sigma\) - 95% fall within \(\mu \pm 2\sigma\) - 99.7% fall within \(\mu \pm 3\sigma\)

For our height example (\(\mu = 175\), \(\sigma = 7\)): - 68% of men are between 168–182 cm - 95% are between 161–189 cm - 99.7% are between 154–196 cm

Someone 196 cm tall (6'5") is a three-sigma event — only 0.15% of the population. Someone 2 meters tall is beyond \(3.5\sigma\) — you'd expect roughly 1 in 4,300 people.

Example 3: Why Quality Control Uses "Six Sigma"¶

In manufacturing, a "six sigma" process means defects are 6 standard deviations from the mean. The probability of a value beyond \(6\sigma\):

\[ P(|x - \mu| > 6\sigma) \approx 0.0000002\% = 2 \text{ defects per billion} \]

The bell curve drops off incredibly fast. At \(3\sigma\) you get 1 in 370. At \(4\sigma\), 1 in 31,574. At \(6\sigma\), 1 in 506 million. That exponential \(e^{-x^2}\) is vicious in the tails.

Common Mistakes¶

Thinking \(f(x)\) is a probability: It's a probability density. For continuous distributions, \(P(X = \text{exactly } 182) = 0\). You need to integrate over an interval to get actual probabilities: \(P(a < X < b) = \int_a^b f(x)\,dx\).
Assuming everything is normal: The CLT says that averages tend to be normal. Individual data points might be wildly non-normal — incomes are skewed, earthquakes follow power laws, stock returns have fat tails. Don't slap a Gaussian on data without checking.
Confusing \(\sigma\) and \(\sigma^2\): The formula uses \(\sigma\) (standard deviation) in the denominator and \(\sigma^2\) (variance) in the exponent's denominator. Mixing them up changes the distribution completely.
Forgetting the normalizing constant: When comparing two Gaussians with different \(\sigma\), the \(\frac{1}{\sigma\sqrt{2\pi}}\) matters. A narrow Gaussian is taller at its peak than a wide one — total area must be 1, so narrower means taller.

Standard Error — built on the Gaussian through the Central Limit Theorem
Central Limit Theorem — why sums of random variables converge to this distribution
z-Score — the standardized distance \(\frac{x - \mu}{\sigma}\) at the heart of the exponent
Confidence Interval — uses the normal distribution to quantify uncertainty
Chi-Squared Distribution — what happens when you square normally distributed variables
Maximum Likelihood Estimation — the method Gauss used to derive this distribution

History¶

The Gaussian distribution was discovered three separate times, by three different people, for three different reasons.

1733 — Abraham de Moivre, a Huguenot refugee making a living from gambling mathematics in London's coffee houses, publishes the bell curve as an approximation to the binomial distribution. His work appears in The Doctrine of Chances. Hardly anyone notices.
1774 — Pierre-Simon Laplace independently rediscovers the distribution while working on the theory of errors. He later proves the Central Limit Theorem (1812), giving the curve its theoretical foundation — and its claim to universality.
1809 — Carl Friedrich Gauss derives it from the principle that the arithmetic mean should be the best estimator. He publishes it in Theoria Motus, his masterwork on celestial mechanics. Because Gauss is Gauss, the distribution gets his name.
1810 — Laplace proves the CLT more rigorously, showing that any sum of independent random variables converges to this shape. The bell curve goes from "convenient approximation" to "fundamental law of probability."
1835 — Adolphe Quetelet, a Belgian astronomer turned social scientist, applies the normal distribution to human characteristics — height, chest circumference, even crime rates. He calls his idealized average person l'homme moyen ("the average man"). It's the birth of social statistics, and also the birth of some deeply problematic ideas about "normal" people.
1893 — Karl Pearson coins the name "standard deviation" and formalizes much of the mathematical machinery around the normal distribution.
1920s–30s — Ronald Fisher builds the modern framework of statistical inference on a Gaussian foundation — ANOVA, maximum likelihood, sufficiency — cementing the normal distribution's place at the center of 20th-century science.

What started as a gambling trick in a London coffee house became the most important probability distribution in mathematics. Not bad for a curve that a refugee scribbled in the margins of a book about dice.

References¶

de Moivre, A. (1738). The Doctrine of Chances, 2nd edition.
Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.
Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900.
Stahl, S. (2006). "The evolution of the normal distribution." Mathematics Magazine.
Feynman, R. P. The Feynman Lectures on Physics, Vol. 1, Ch. 6.