Standard Error¶

The Formula¶

\[ SE = \frac{\sigma}{\sqrt{n}} \]

Or, when you don't know the population standard deviation (which is almost always):

\[ SE = \frac{s}{\sqrt{n}} \]

What It Means¶

You run a survey of 100 people and find their average height is 172 cm. Cool. But if you grabbed a different 100 people, you'd get a slightly different number — maybe 171.3, maybe 173.1. The standard error tells you how much that average would jump around if you kept repeating the experiment.

It's not about how spread out individuals are (that's the standard deviation). It's about how much your sample mean wobbles. A subtle but crucial difference — and confusing the two is one of the most common mistakes in statistics.

Why It Works — The Story Behind the Formula¶

The Problem That Started It All¶

Imagine it's the early 1800s. Astronomers are trying to measure the position of a star. Every measurement they take gives a slightly different answer — instruments shake, eyes blur, the atmosphere ripples. They know the true position is somewhere in that cloud of measurements, but where exactly?

They had a brilliant idea: take the average. Everyone knew that averaging measurements seemed to "cancel out" errors. But nobody could say precisely how good that average was. If you take 10 measurements versus 100, your average should be better — but how much better?

This is the question that haunted astronomers, and it was eventually answered by a chain of thinkers from Gauss to Laplace to the modern era of statistics.

Why \(\sigma\) Is in the Numerator¶

This part is intuitive. If your individual measurements are all over the place (large \(\sigma\)), then your average is going to be shaky too. If every measurement is nearly identical (small \(\sigma\)), the average is rock solid. The standard error inherits the messiness of the raw data — it has to.

Why \(\sqrt{n}\) Is in the Denominator (Not \(n\))¶

This is the part that surprises people. You'd think that 100 measurements should be 100 times better than 1. But they're not. They're only \(\sqrt{100} = 10\) times better. Why?

Here's the intuition. When you average \(n\) measurements, each one contributes a little random error. If these errors were all in the same direction, they'd add up to \(n\) times the individual error, and averaging would do nothing. But errors are random — some positive, some negative — so they partially cancel out.

How much do they cancel? This is where the math gets beautiful. If each measurement has variance \(\sigma^2\), then the sum of \(n\) independent measurements has variance \(n\sigma^2\) (variances add for independent variables). The average divides by \(n\), so:

\[ \text{Var}(\bar{x}) = \text{Var}\left(\frac{\sum x_i}{n}\right) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n} \]

Take the square root to get back to standard deviation units:

\[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]

The \(\sqrt{n}\) appears because we're working with variances (squared quantities), which add linearly, and then taking a square root at the end. It's the same reason the hypotenuse of a right triangle isn't the sum of the two sides — randomness follows Pythagorean-style addition, not simple addition.

The Random Walk Analogy¶

Think of a drunk person stumbling randomly — one step left, one step right, completely unpredictable. After \(n\) steps, how far are they from where they started?

Not \(n\) steps away (that would mean they always stumble in the same direction). Typically, they're only about \(\sqrt{n}\) steps away. After 100 random steps, they've drifted about 10 steps from the start.

Your sample mean is doing exactly this, but in reverse. Each measurement adds a random "step" of error. After \(n\) measurements, the accumulated error in the average grows like \(\sqrt{n}\) — but you're dividing by \(n\), so the error in the mean shrinks like \(\frac{1}{\sqrt{n}}\).

This is why the standard error has \(\sqrt{n}\) in the denominator. It's the signature of random cancellation.

The Law of Diminishing Returns¶

This \(\sqrt{n}\) business has a profound practical consequence: there are diminishing returns to collecting more data.

Going from 1 to 4 measurements cuts your error in half. Going from 4 to 16 cuts it in half again. But going from 100 to 200? That only reduces your error by about 30%. To halve your error from where you are now, you always need four times as much data.

This is why pharmaceutical trials cost so much. This is why pollsters can't just "survey more people" to get perfect predictions. The \(\sqrt{n}\) is a fundamental speed limit on how fast certainty grows.

Step-by-Step Derivation¶

Starting from first principles:

Assume \(n\) independent measurements \(x_1, x_2, \ldots, x_n\), each from the same distribution with mean \(\mu\) and variance \(\sigma^2\)
The sample mean is:

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \]

Variance of the mean — since the \(x_i\) are independent:

\[ \text{Var}(\bar{x}) = \text{Var}\left(\frac{1}{n}\sum x_i\right) = \frac{1}{n^2}\sum \text{Var}(x_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n} \]

Standard error is the square root of this variance:

\[ SE = \sqrt{\text{Var}(\bar{x})} = \frac{\sigma}{\sqrt{n}} \]

In practice, we don't know \(\sigma\), so we estimate it with the sample standard deviation \(s\):

\[ SE \approx \frac{s}{\sqrt{n}} \]

Variables Explained¶

Symbol	Name	Description
\(SE\)	Standard error	How much the sample mean would vary across repeated samples
\(\sigma\)	Population standard deviation	The true spread of individual values (usually unknown)
\(s\)	Sample standard deviation	Our estimate of \(\sigma\) from the data we have
\(n\)	Sample size	Number of observations in the sample
\(\bar{x}\)	Sample mean	The average of the observed values

Worked Examples¶

Example 1: Exam Scores¶

A professor samples 25 students' exam scores. The sample standard deviation is \(s = 15\) points.

\[ SE = \frac{15}{\sqrt{25}} = \frac{15}{5} = 3 \text{ points} \]

If the sample mean was 72, the true class average is probably within a few points of 72. If she'd only sampled 4 students, the SE would be \(\frac{15}{\sqrt{4}} = 7.5\) — much shakier.

Example 2: The Polling Problem¶

A political poll surveys 1,000 people and finds 52% support a candidate. Support is binary (yes/no), so \(s \approx \sqrt{0.52 \times 0.48} \approx 0.5\).

\[ SE = \frac{0.5}{\sqrt{1000}} \approx 0.016 = 1.6\% \]

That's the famous "margin of error" you hear about — roughly \(\pm 3.2\%\) (two standard errors). With 52% support and a 3.2% margin, the race is a toss-up. To cut that margin in half, you'd need 4,000 respondents — not 2,000.

Example 3: The Diminishing Returns¶

A lab measures the boiling point of a liquid. With \(s = 0.8°C\):

\(n\)	\(SE\)	Improvement
4	0.40°C	—
16	0.20°C	2× better
64	0.10°C	4× better
256	0.05°C	8× better

Each 2× improvement in precision costs 4× the data. Nature's tax on certainty.

Common Mistakes¶

Confusing SE with standard deviation: The standard deviation (\(\sigma\) or \(s\)) describes how spread out individuals are. The standard error describes how uncertain the mean is. They answer different questions. The SE is always smaller — often much smaller.
Thinking more data always helps a lot: Because of the \(\sqrt{n}\), each additional data point helps less than the last one. Going from 10 to 11 samples barely changes anything. Going from 1 to 2 changes a lot.
Forgetting independence: The whole derivation assumes measurements are independent. If you survey 100 people at the same party, you don't really have 100 independent data points. The formula will make you overconfident.
Using SE when you mean SD (and vice versa): Plotting error bars? If you want to show how variable individuals are, use SD. If you want to show how precisely you know the mean, use SE. Many papers get this wrong.

Standard Deviation — the spread of individual values
Confidence Interval — uses SE to build a range around the mean
Central Limit Theorem — why the sample mean is approximately normal
t-Distribution — what happens when \(n\) is small and you estimate \(\sigma\) with \(s\)

History¶

1733 — Abraham de Moivre discovers the normal distribution, which underpins the entire theory
1809 — Carl Friedrich Gauss publishes his method of least squares and the "Gaussian" error curve. He shows that the mean minimizes squared errors — but doesn't quite formalize the standard error
1812 — Pierre-Simon Laplace proves the Central Limit Theorem: the average of many measurements tends toward a normal distribution, no matter what the individual measurements look like. This is the theoretical engine that makes the standard error meaningful
1908 — William Sealy Gosset, a chemist at the Guinness brewery in Dublin, faces a problem: he's working with tiny samples (5-10 batches of barley), and the standard error formula pretends you know \(\sigma\) exactly. He doesn't — he only has \(s\). Under the pen name "Student" (Guinness didn't allow employees to publish), he derives the t-distribution, which accounts for the extra uncertainty when \(\sigma\) is estimated. The modern standard error formula with \(s\) instead of \(\sigma\) is really Gosset's contribution
1930s — Jerzy Neyman and Egon Pearson build confidence intervals on top of the standard error, giving it the central role it plays in modern statistics

The story of the standard error is really the story of learning to be honest about uncertainty — from astronomers who just wanted to average their measurements, to a brewer who realized that small samples need special care.

References¶

Stigler, Stephen M. The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press, 1986.
Student (Gosset, W. S.). "The probable error of a mean." Biometrika, 1908.
Feynman, R. P. The Feynman Lectures on Physics, Vol. 1, Ch. 6 (Probability).