Prediction Interval¶
The Formula¶
For predicting a single new observation from a normal population:
For predicting a new observation at \(x_h\) in simple linear regression:
What It Means¶
A confidence interval answers: "Where is the true average?"
A prediction interval answers a different question: "Where will the next individual observation land?"
Even if you knew the true mean exactly — no uncertainty at all — individual observations would still scatter around it. People aren't all the same height. Batteries don't all last the same number of hours. The prediction interval captures both your uncertainty about the mean and this irreducible individual variation.
That's why a prediction interval is always wider than a confidence interval. Always. No exceptions. And unlike a confidence interval, it never shrinks to zero no matter how much data you collect — because individual randomness doesn't go away.
Why It Works — The Story Behind the Formula¶
The Key Insight: Two Sources of Uncertainty¶
Suppose you've collected \(n\) observations and now want to predict a new one, \(X_{\text{new}}\). Your best guess is \(\bar{x}\), the sample mean. The prediction error is:
What's the variance of this error? Here's where it gets interesting. \(X_{\text{new}}\) and \(\bar{X}\) are independent — the new observation hasn't been collected yet, so it can't possibly be correlated with the data you already have. This independence is crucial, because it means:
No cross-term. No covariance to worry about. The variances just add.
If the variables were not independent, we'd need \(\text{Var}(A - B) = \text{Var}(A) + \text{Var}(B) - 2\text{Cov}(A, B)\). But \(\text{Cov}(X_{\text{new}}, \bar{X}) = 0\) because \(X_{\text{new}}\) is a fresh draw, independent of the training data. So the covariance term vanishes.
Full Derivation — The Mean Case¶
Step 1: Set up the prediction error
We want to predict \(X_{\text{new}} \sim N(\mu, \sigma^2)\) using our sample mean \(\bar{X}\). The error is:
Step 2: Find the expected value of the error
We can split the expectation because \(E(A - B) = E(A) - E(B)\) always (linearity of expectation — no assumptions needed).
Good — our prediction is unbiased. On average, we're dead on.
Step 3: Find the variance of the error
The variances add (not subtract) because \(\text{Var}(A - B) = \text{Var}(A) + \text{Var}(B)\) when \(A\) and \(B\) are independent. The minus sign in \(A - B\) becomes a plus in the variance — variance doesn't care about direction, only magnitude. Formally: \(\text{Var}(-B) = (-1)^2 \text{Var}(B) = \text{Var}(B)\).
Now plug in:
This is just the variance of a single observation from the population — it's what \(\sigma^2\) means.
This is the standard error squared. We derived this elsewhere: \(\text{Var}\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\).
So:
Step 4: The standard deviation of the error
Step 5: Standardize
Since both \(X_{\text{new}}\) and \(\bar{X}\) are normal (sum/difference of normals is normal), and we replace \(\sigma\) with \(s\):
This follows a \(t\)-distribution (not normal) because we replaced \(\sigma\) with \(s\), introducing the same extra uncertainty that Gosset identified. The degrees of freedom are \(n - 1\) because \(s\) uses \(n - 1\) degrees of freedom.
Step 6: Invert to get the interval
The "+1" — Why It Never Disappears¶
Look at the term under the square root: \(1 + \frac{1}{n}\).
As \(n \to \infty\), the \(\frac{1}{n}\) vanishes. The estimation uncertainty disappears — with infinite data, you know \(\mu\) perfectly. But the 1 stays. It represents \(\text{Var}(X_{\text{new}}) = \sigma^2\), the scatter of the individual around the mean. No amount of data eliminates this. Even an omniscient being who knows \(\mu\) exactly would still face \(\sigma^2\) of unpredictability for the next observation.
Compare the confidence interval: \(\bar{x} \pm t \cdot s / \sqrt{n}\). Here the term under the root is just \(\frac{1}{n}\), which goes to zero. The CI shrinks to a point as \(n\) grows. The prediction interval shrinks to \(\pm t \cdot \sigma\) — and stops.
This is the fundamental difference:
| Confidence Interval | Prediction Interval | |
|---|---|---|
| Asks | Where is \(\mu\)? | Where will \(X_{\text{new}}\) be? |
| Variance | \(\frac{\sigma^2}{n}\) | \(\sigma^2 + \frac{\sigma^2}{n}\) |
| As \(n \to \infty\) | Shrinks to zero | Shrinks to \(\pm z \cdot \sigma\) |
| Irreducible? | No | Yes — the \(\sigma^2\) floor |
The Regression Case¶
In simple linear regression, predicting \(Y_{\text{new}}\) at a specific \(x_h\), three sources of uncertainty combine:
The first term is the irreducible floor. The second is uncertainty about the overall level. The third is slope uncertainty, amplified by how far \(x_h\) is from \(\bar{x}\) (the bowtie effect).
Giving the interval:
The degrees of freedom are \(n - 2\) here (not \(n - 1\)) because regression estimates two parameters (\(\beta_0\) and \(\beta_1\)), spending two degrees of freedom.
Variables Explained¶
| Symbol | Name | Description |
|---|---|---|
| \(X_{\text{new}}\) | New observation | The future value we're trying to predict |
| \(\bar{x}\) | Sample mean | Our best guess for the mean (or \(\hat{Y}_h\) in regression) |
| \(s\) | Sample standard deviation | Estimated spread of individual observations |
| \(n\) | Sample size | Number of observations in the training data |
| \(t_{df,\,\alpha/2}\) | Critical value | Multiplier from the \(t\)-distribution |
| \(\sigma^2\) | Population variance | The irreducible individual scatter |
Worked Examples¶
Example 1: Battery Life¶
You test \(n = 16\) batteries. \(\bar{x} = 48\) hours, \(s = 4\) hours. Predict where the next battery will fall (95%).
\(t_{15,\, 0.025} = 2.131\)
Confidence interval (for the mean):
CI: (45.9, 50.1) — where the average battery life is.
Prediction interval (for the next battery):
PI: (39.2, 56.8) — where the next individual battery will probably land.
The PI is over 4 times wider. Even though we have a decent estimate of the mean, individual batteries vary a lot.
Example 2: Effect of Sample Size¶
Same \(s = 4\), varying \(n\):
| \(n\) | CI half-width | PI half-width | PI / CI ratio |
|---|---|---|---|
| 4 | 6.59 | 10.48 | 1.6x |
| 16 | 2.13 | 8.79 | 4.1x |
| 100 | 0.80 | 8.06 | 10.1x |
| 10000 | 0.08 | 7.85 | 98x |
The CI shrinks relentlessly. The PI barely budges — it's dominated by the \(\sigma^2\) floor. At \(n = 10{,}000\) the CI is razor-thin but the PI is still about \(\pm 8\) hours. You know the mean with extreme precision, but individual batteries still scatter.
Example 3: Regression — Exam Scores¶
From the SLR pages: \(n = 5\), \(\bar{x} = 5\), \(S_{xx} = 26\), \(\hat{Y}_h = 82.04\) at \(x_h = 6\) hours, \(s = 2.13\).
\(t_{3,\, 0.025} = 3.182\)
CI (mean score for students studying 6 hours):
CI: (78.7, 85.4)
PI (score for one specific student studying 6 hours):
PI: (74.5, 89.6)
The PI is more than twice as wide — because one student's score has inherent randomness beyond what the regression line can predict.
Common Mistakes¶
- Using a CI when you need a PI: If someone asks "what will this specific patient's blood pressure be?", you need a PI. If they ask "what is the average blood pressure for people on this medication?", you need a CI. Most real-world decisions are about individuals, not averages — so PIs are often what you actually want.
- Expecting the PI to shrink to zero: It won't. The \(\sigma^2\) floor is a hard limit. If your model's residual standard deviation is 10, your 95% PI will never be narrower than about \(\pm 20\), no matter how much data you collect.
- Forgetting the PI gets wider in regression: At extreme \(x_h\) values (far from \(\bar{x}\)), the \((x_h - \bar{x})^2/S_{xx}\) term blows up. Extrapolation makes predictions even less reliable.
- Reporting CIs and calling them "predictions": Many papers report CI bands around regression lines and claim they show prediction uncertainty. They don't — they show uncertainty about the mean. Real prediction bands are much wider.
Related Formulas¶
- Confidence Interval — for means, not individuals
- Standard Error — the \(s/\sqrt{n}\) component of both intervals
- SLR: Mean Response and Prediction — both intervals in the regression context
- Gaussian Distribution — the assumed shape of the individual scatter
History¶
- 1908 — Gosset's \(t\)-distribution makes small-sample intervals possible — both confidence and prediction.
- 1937 — Neyman formalizes confidence intervals, carefully distinguishing between estimating parameters and predicting observables. The prediction interval is the latter.
- 1960s–70s — As regression becomes the workhorse of applied science, the distinction between confidence and prediction bands gains practical importance. Textbooks begin emphasizing the "+1" — but many practitioners still confuse the two.
- Today — Machine learning has revived interest in prediction intervals under the name "prediction uncertainty" or "calibrated uncertainty." Modern methods (conformal prediction, Bayesian prediction) extend the idea beyond normality assumptions.
References¶
- Kutner, M. H., et al. (2004). Applied Linear Statistical Models, 5th ed.
- Hahn, G. J. & Meeker, W. Q. (1991). Statistical Intervals: A Guide for Practitioners. Wiley.
- Morey, R. D., et al. (2016). "The fallacy of placing confidence in confidence intervals." Psychonomic Bulletin & Review.