Confidence Interval¶
The Formula¶
For the mean of a population:
Or more generally, for any estimator \(\hat{\theta}\):
What It Means¶
You measure something — say the average height of students at your university. You survey 50 people and get \(\bar{x} = 172\) cm. But you know that if you'd picked a different 50 people, you'd have gotten a slightly different number. So how close is 172 to the true average?
A 95% confidence interval says: "We computed an interval from this data. If we repeated this entire procedure many times — new sample, new interval, every time — then 95% of those intervals would contain the true value."
It's not saying there's a 95% probability the true value is in this particular interval. The true value is fixed — it either is in there or it isn't. What's random is the interval itself. This subtlety drove statisticians crazy for decades, and it still trips people up.
Why It Works — The Story Behind the Formula¶
The Core Idea: Inverting a Test¶
The confidence interval was born from a clever trick: flip a probability statement about the estimator into a statement about the parameter.
Here's the logic chain, step by step.
Step 1: The sampling distribution of \(\bar{x}\)
We know from the Central Limit Theorem that for large \(n\) (or if the population is normal):
This says: \(\bar{x}\) is a random variable centered on the true mean \(\mu\), with spread \(\sigma/\sqrt{n}\).
Step 2: Standardize it
Subtract the mean and divide by the standard deviation to get a standard normal:
This is just a change of units — now we're measuring "how many standard errors is \(\bar{x}\) away from \(\mu\)?"
Step 3: Write a probability statement
For a 95% interval, we need the middle 95% of the standard normal. That's between \(-1.96\) and \(+1.96\):
This is a true probability statement — before we collect data, there's a 95% chance that \(\bar{x}\) will land within 1.96 standard errors of \(\mu\).
Step 4: The inversion trick — solve for \(\mu\)
Now rearrange the inequality. Multiply everything by \(\sigma/\sqrt{n}\):
Subtract \(\bar{x}\) from all parts and multiply by \(-1\) (which flips the inequalities):
And there it is. The confidence interval:
Notice what happened: we started with a probability about \(\bar{x}\) (random) relative to \(\mu\) (fixed), and by rearranging, we got an interval around \(\bar{x}\) that traps \(\mu\). The probability still refers to the procedure, not to \(\mu\) — because \(\mu\) never moved. We just rearranged who's in the middle of the inequality.
But We Don't Know \(\sigma\) — Enter Gosset¶
In the real world, you almost never know \(\sigma\). You estimate it with the sample standard deviation \(s\). But plugging in \(s\) for \(\sigma\) introduces extra uncertainty — you're now uncertain about two things (the mean and the spread).
William Sealy Gosset, the Guinness brewery chemist, solved this in 1908. When you replace \(\sigma\) with \(s\), the standardized statistic:
no longer follows a normal distribution. It follows a \(t\)-distribution with \(n - 1\) degrees of freedom — a distribution that's wider than the normal, with heavier tails, especially when \(n\) is small.
Why \(n - 1\) degrees of freedom? Because \(s\) is computed from the same data as \(\bar{x}\). The deviations \(x_i - \bar{x}\) are forced to sum to zero (that's what "deviation from the mean" means), so only \(n - 1\) of them are free to vary. You "spent" one degree of freedom estimating \(\bar{x}\).
The confidence interval becomes:
where \(t_{n-1,\, \alpha/2}\) is the critical value from the \(t\)-distribution. For large \(n\), the \(t\)-distribution converges to the normal (\(t_{n-1} \to z\) as \(n \to \infty\)), and the distinction vanishes.
Dissecting the Width¶
The half-width of the interval is:
Three levers control how wide the interval is:
\(t_{n-1,\, \alpha/2}\) (the confidence level): Higher confidence = wider interval. A 99% CI uses a bigger multiplier (\(\approx 2.58\)) than a 95% CI (\(\approx 1.96\)). You can't get more confidence without accepting more vagueness. At 100% confidence, the interval would be \((-\infty, +\infty)\) — completely useless but technically correct.
\(s\) (the data variability): Messier data = wider interval. You inherit the noise of the world. Nothing you can do about this except design better experiments.
\(\sqrt{n}\) (the sample size): More data = narrower interval, but with diminishing returns. The \(\sqrt{n}\) in the denominator means doubling your precision requires quadrupling your sample size. Going from \(n = 100\) to \(n = 400\) cuts the margin of error in half. Going from \(n = 400\) to \(n = 1600\) halves it again.
Step-by-Step Derivation (Self-Contained)¶
Setup and Assumptions¶
We assume \(X_1, X_2, \ldots, X_n\) are independent and identically distributed (i.i.d.) from a normal population \(N(\mu, \sigma^2)\), where \(\mu\) is unknown.
The Known-\(\sigma\) Case (Z-interval)¶
1. The sample mean is:
2. Since each \(X_i \sim N(\mu, \sigma^2)\) and they're independent:
Here we used the fact that expectation is linear: \(E(aX) = aE(X)\), and \(E(X_1 + X_2) = E(X_1) + E(X_2)\) always, regardless of independence.
3. For the variance:
The \(\frac{1}{n}\) became \(\frac{1}{n^2}\) because \(\text{Var}(aX) = a^2 \text{Var}(X)\) — constants come out squared from variance (unlike expectation, where they come out linearly). And we can split the sum because the \(X_i\) are independent — for independent variables, \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) (no covariance term).
4. Since a linear combination of normals is normal:
5. Standardize — subtract the mean, divide by the standard deviation:
The numerator has mean 0 (we subtracted \(\mu\)). The denominator is the standard deviation of \(\bar{X}\) (which is \(\sigma/\sqrt{n}\)). So \(Z\) has mean 0 and variance 1.
6. Write the probability statement using the \(z_{\alpha/2}\) quantile:
7. Substitute back \(Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\) and rearrange for \(\mu\):
The Unknown-\(\sigma\) Case (t-interval)¶
8. Replace \(\sigma\) with the sample standard deviation:
We divide by \(n - 1\) (not \(n\)) because the deviations \(X_i - \bar{X}\) are constrained: they must sum to zero (\(\sum(X_i - \bar{X}) = 0\) by definition of \(\bar{X}\)). This constraint removes one degree of freedom, so only \(n - 1\) deviations are "free." Dividing by \(n - 1\) corrects for this and makes \(s^2\) an unbiased estimator of \(\sigma^2\).
9. The ratio:
now follows a \(t\)-distribution with \(n - 1\) degrees of freedom. This is not normal because \(s\) is itself random — it fluctuates from sample to sample. The \(t\)-distribution accounts for this extra variability by having heavier tails (more probability in the extremes). When \(n\) is large, \(s \approx \sigma\) with high probability, the extra variability vanishes, and \(t_{n-1} \to N(0,1)\).
10. The confidence interval:
Variables Explained¶
| Symbol | Name | Description |
|---|---|---|
| \(\bar{x}\) | Sample mean | The center of the interval |
| \(\mu\) | Population mean | The true value we're trying to trap |
| \(s\) | Sample standard deviation | Our estimate of the population spread |
| \(n\) | Sample size | Number of observations |
| \(t_{n-1,\, \alpha/2}\) | Critical value | The multiplier from the \(t\)-distribution |
| \(1 - \alpha\) | Confidence level | Typically 0.95 (95%) or 0.99 (99%) |
| \(\text{SE}\) | Standard error | \(s / \sqrt{n}\), the estimated standard deviation of \(\bar{x}\) |
Worked Examples¶
Example 1: Average Commute Time¶
You survey \(n = 36\) commuters. Mean commute: \(\bar{x} = 42\) minutes. Sample SD: \(s = 12\) minutes. Build a 95% CI.
Since \(n = 36\) is reasonably large, \(t_{35,\, 0.025} \approx 2.03\):
95% CI: (37.9, 46.1) minutes. The true average commute is probably between 38 and 46 minutes.
Example 2: Small Sample — Battery Life¶
You test \(n = 8\) batteries. Mean life: \(\bar{x} = 48.2\) hours. \(s = 3.6\) hours. 95% CI.
With \(n - 1 = 7\) degrees of freedom, \(t_{7,\, 0.025} = 2.365\) (noticeably larger than 1.96 — small samples pay a penalty):
95% CI: (45.2, 51.2) hours. Wider than you'd get with 36 batteries — both because \(n\) is small (larger SE) and because \(t_7 > t_{35}\) (larger multiplier).
Example 3: How Sample Size Affects Width¶
Same data (\(s = 12\), 95% CI), varying \(n\):
| \(n\) | SE | \(t\) critical | Margin of Error |
|---|---|---|---|
| 9 | 4.00 | 2.306 | 9.22 |
| 36 | 2.00 | 2.028 | 4.06 |
| 144 | 1.00 | 1.977 | 1.98 |
| 576 | 0.50 | 1.964 | 0.98 |
Each 4x increase in \(n\) halves the margin of error. The \(t\) critical value also shrinks toward 1.96 as \(n\) grows — a double benefit of larger samples.
Common Mistakes¶
- "There's a 95% probability \(\mu\) is in this interval": No. \(\mu\) is fixed. The interval is random. Before collecting data, there's a 95% chance the procedure produces an interval that covers \(\mu\). After you compute the interval, it either covers \(\mu\) or it doesn't — you just don't know which.
- Thinking wider = worse: A wider interval is more honest, not worse. If your data is noisy or your sample is small, a narrow interval would be lying about your uncertainty.
- Using \(z\) instead of \(t\) with small samples: When \(n < 30\) and \(\sigma\) is unknown (it almost always is), the \(z\)-interval is too narrow. The \(t\)-distribution corrects for the extra uncertainty in estimating \(\sigma\).
- Ignoring assumptions: The CI assumes independent observations and (for small \(n\)) approximate normality. If your data is heavily skewed or has clusters, the interval may not achieve its stated coverage.
- Confusing confidence interval with prediction interval: A CI estimates where the mean is. A prediction interval estimates where the next observation will be. The latter is always wider.
Related Formulas¶
- Standard Error — the \(s/\sqrt{n}\) that forms the core of the CI
- Prediction Interval — for individual observations, not means
- Gaussian Distribution — the distributional foundation
- SLR: Mean Response and Prediction — CIs in the regression context
History¶
- 1908 — William Sealy Gosset (as "Student") derives the \(t\)-distribution. He's working with samples of 4–10 barley batches at Guinness, far too small for the normal approximation. His solution: account for the randomness in \(s\) itself. He didn't call it a confidence interval — that language came later.
- 1930s — Jerzy Neyman formalizes the theory of confidence intervals in a landmark 1937 paper. He defines the coverage probability interpretation: the interval is a random object that, across repeated sampling, covers the true parameter with the stated probability. This was controversial — Ronald Fisher hated it, preferring his own "fiducial inference" approach. The Neyman-Fisher feud over the correct interpretation of interval estimates lasted decades.
- 1937 — Neyman presents his framework at a Royal Statistical Society meeting. In the discussion, Arthur Bowley reportedly asks: "Does this just mean that we should use wider intervals?" Neyman's reply — that the width depends on the desired confidence level and the data — essentially defines modern interval estimation.
References¶
- Neyman, J. (1937). "Outline of a theory of statistical estimation based on the classical theory of probability." Philosophical Transactions of the Royal Society.
- Student (Gosset, W. S.) (1908). "The probable error of a mean." Biometrika.
- Morey, R. D., et al. (2016). "The fallacy of placing confidence in confidence intervals." Psychonomic Bulletin & Review. (A great paper on what CIs actually mean.)