Skip to content

Prediction Interval

The Formula

For predicting a single new observation from a normal population:

\[ \bar{x} \pm t_{n-1,\, \alpha/2} \cdot s\sqrt{1 + \frac{1}{n}} \]

For predicting a new observation at \(x_h\) in simple linear regression:

\[ \hat{Y}_h \pm t_{n-2,\, \alpha/2} \cdot s\sqrt{1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}} \]

What It Means

A confidence interval answers: "Where is the true average?"

A prediction interval answers a different question: "Where will the next individual observation land?"

Even if you knew the true mean exactly — no uncertainty at all — individual observations would still scatter around it. People aren't all the same height. Batteries don't all last the same number of hours. The prediction interval captures both your uncertainty about the mean and this irreducible individual variation.

That's why a prediction interval is always wider than a confidence interval. Always. No exceptions. And unlike a confidence interval, it never shrinks to zero no matter how much data you collect — because individual randomness doesn't go away.

Why It Works — The Story Behind the Formula

The Key Insight: Two Sources of Uncertainty

Suppose you've collected \(n\) observations and now want to predict a new one, \(X_{\text{new}}\). Your best guess is \(\bar{x}\), the sample mean. The prediction error is:

\[ X_{\text{new}} - \bar{X} \]

What's the variance of this error? Here's where it gets interesting. \(X_{\text{new}}\) and \(\bar{X}\) are independent — the new observation hasn't been collected yet, so it can't possibly be correlated with the data you already have. This independence is crucial, because it means:

\[ \text{Var}(X_{\text{new}} - \bar{X}) = \text{Var}(X_{\text{new}}) + \text{Var}(\bar{X}) \]

No cross-term. No covariance to worry about. The variances just add.

If the variables were not independent, we'd need \(\text{Var}(A - B) = \text{Var}(A) + \text{Var}(B) - 2\text{Cov}(A, B)\). But \(\text{Cov}(X_{\text{new}}, \bar{X}) = 0\) because \(X_{\text{new}}\) is a fresh draw, independent of the training data. So the covariance term vanishes.

Full Derivation — The Mean Case

Step 1: Set up the prediction error

We want to predict \(X_{\text{new}} \sim N(\mu, \sigma^2)\) using our sample mean \(\bar{X}\). The error is:

\[ X_{\text{new}} - \bar{X} \]

Step 2: Find the expected value of the error

\[ E(X_{\text{new}} - \bar{X}) = E(X_{\text{new}}) - E(\bar{X}) \]

We can split the expectation because \(E(A - B) = E(A) - E(B)\) always (linearity of expectation — no assumptions needed).

\[ = \mu - \mu = 0 \]

Good — our prediction is unbiased. On average, we're dead on.

Step 3: Find the variance of the error

\[ \text{Var}(X_{\text{new}} - \bar{X}) = \text{Var}(X_{\text{new}}) + \text{Var}(\bar{X}) \]

The variances add (not subtract) because \(\text{Var}(A - B) = \text{Var}(A) + \text{Var}(B)\) when \(A\) and \(B\) are independent. The minus sign in \(A - B\) becomes a plus in the variance — variance doesn't care about direction, only magnitude. Formally: \(\text{Var}(-B) = (-1)^2 \text{Var}(B) = \text{Var}(B)\).

Now plug in:

\[ \text{Var}(X_{\text{new}}) = \sigma^2 \]

This is just the variance of a single observation from the population — it's what \(\sigma^2\) means.

\[ \text{Var}(\bar{X}) = \frac{\sigma^2}{n} \]

This is the standard error squared. We derived this elsewhere: \(\text{Var}\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\).

So:

\[ \text{Var}(X_{\text{new}} - \bar{X}) = \sigma^2 + \frac{\sigma^2}{n} = \sigma^2\left(1 + \frac{1}{n}\right) \]

Step 4: The standard deviation of the error

\[ \text{SD}(X_{\text{new}} - \bar{X}) = \sigma\sqrt{1 + \frac{1}{n}} \]

Step 5: Standardize

Since both \(X_{\text{new}}\) and \(\bar{X}\) are normal (sum/difference of normals is normal), and we replace \(\sigma\) with \(s\):

\[ T = \frac{X_{\text{new}} - \bar{X}}{s\sqrt{1 + \frac{1}{n}}} \sim t_{n-1} \]

This follows a \(t\)-distribution (not normal) because we replaced \(\sigma\) with \(s\), introducing the same extra uncertainty that Gosset identified. The degrees of freedom are \(n - 1\) because \(s\) uses \(n - 1\) degrees of freedom.

Step 6: Invert to get the interval

\[ \boxed{\bar{X} \pm t_{n-1,\,\alpha/2} \cdot s\sqrt{1 + \frac{1}{n}}} \]

The "+1" — Why It Never Disappears

Look at the term under the square root: \(1 + \frac{1}{n}\).

As \(n \to \infty\), the \(\frac{1}{n}\) vanishes. The estimation uncertainty disappears — with infinite data, you know \(\mu\) perfectly. But the 1 stays. It represents \(\text{Var}(X_{\text{new}}) = \sigma^2\), the scatter of the individual around the mean. No amount of data eliminates this. Even an omniscient being who knows \(\mu\) exactly would still face \(\sigma^2\) of unpredictability for the next observation.

Compare the confidence interval: \(\bar{x} \pm t \cdot s / \sqrt{n}\). Here the term under the root is just \(\frac{1}{n}\), which goes to zero. The CI shrinks to a point as \(n\) grows. The prediction interval shrinks to \(\pm t \cdot \sigma\) — and stops.

This is the fundamental difference:

Confidence Interval Prediction Interval
Asks Where is \(\mu\)? Where will \(X_{\text{new}}\) be?
Variance \(\frac{\sigma^2}{n}\) \(\sigma^2 + \frac{\sigma^2}{n}\)
As \(n \to \infty\) Shrinks to zero Shrinks to \(\pm z \cdot \sigma\)
Irreducible? No Yes — the \(\sigma^2\) floor

The Regression Case

In simple linear regression, predicting \(Y_{\text{new}}\) at a specific \(x_h\), three sources of uncertainty combine:

\[ \text{Var}(Y_{\text{new}} - \hat{Y}_h) = \underbrace{\sigma^2}_{\text{individual scatter}} + \underbrace{\sigma^2 \cdot \frac{1}{n}}_{\text{uncertainty in } \bar{y}} + \underbrace{\sigma^2 \cdot \frac{(x_h - \bar{x})^2}{S_{xx}}}_{\text{uncertainty in slope}} \]

The first term is the irreducible floor. The second is uncertainty about the overall level. The third is slope uncertainty, amplified by how far \(x_h\) is from \(\bar{x}\) (the bowtie effect).

\[ = \sigma^2\left(1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}\right) \]

Giving the interval:

\[ \hat{Y}_h \pm t_{n-2,\,\alpha/2} \cdot s\sqrt{1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}} \]

The degrees of freedom are \(n - 2\) here (not \(n - 1\)) because regression estimates two parameters (\(\beta_0\) and \(\beta_1\)), spending two degrees of freedom.

Variables Explained

Symbol Name Description
\(X_{\text{new}}\) New observation The future value we're trying to predict
\(\bar{x}\) Sample mean Our best guess for the mean (or \(\hat{Y}_h\) in regression)
\(s\) Sample standard deviation Estimated spread of individual observations
\(n\) Sample size Number of observations in the training data
\(t_{df,\,\alpha/2}\) Critical value Multiplier from the \(t\)-distribution
\(\sigma^2\) Population variance The irreducible individual scatter

Worked Examples

Example 1: Battery Life

You test \(n = 16\) batteries. \(\bar{x} = 48\) hours, \(s = 4\) hours. Predict where the next battery will fall (95%).

\(t_{15,\, 0.025} = 2.131\)

Confidence interval (for the mean):

\[ 48 \pm 2.131 \times \frac{4}{\sqrt{16}} = 48 \pm 2.131 \times 1 = 48 \pm 2.13 \]

CI: (45.9, 50.1) — where the average battery life is.

Prediction interval (for the next battery):

\[ 48 \pm 2.131 \times 4\sqrt{1 + \frac{1}{16}} = 48 \pm 2.131 \times 4 \times 1.031 = 48 \pm 8.79 \]

PI: (39.2, 56.8) — where the next individual battery will probably land.

The PI is over 4 times wider. Even though we have a decent estimate of the mean, individual batteries vary a lot.

Example 2: Effect of Sample Size

Same \(s = 4\), varying \(n\):

\(n\) CI half-width PI half-width PI / CI ratio
4 6.59 10.48 1.6x
16 2.13 8.79 4.1x
100 0.80 8.06 10.1x
10000 0.08 7.85 98x

The CI shrinks relentlessly. The PI barely budges — it's dominated by the \(\sigma^2\) floor. At \(n = 10{,}000\) the CI is razor-thin but the PI is still about \(\pm 8\) hours. You know the mean with extreme precision, but individual batteries still scatter.

Example 3: Regression — Exam Scores

From the SLR pages: \(n = 5\), \(\bar{x} = 5\), \(S_{xx} = 26\), \(\hat{Y}_h = 82.04\) at \(x_h = 6\) hours, \(s = 2.13\).

\(t_{3,\, 0.025} = 3.182\)

CI (mean score for students studying 6 hours):

\[ 82.04 \pm 3.182 \times 2.13\sqrt{\frac{1}{5} + \frac{1}{26}} = 82.04 \pm 3.31 \]

CI: (78.7, 85.4)

PI (score for one specific student studying 6 hours):

\[ 82.04 \pm 3.182 \times 2.13\sqrt{1 + \frac{1}{5} + \frac{1}{26}} = 82.04 \pm 7.54 \]

PI: (74.5, 89.6)

The PI is more than twice as wide — because one student's score has inherent randomness beyond what the regression line can predict.

Common Mistakes

  • Using a CI when you need a PI: If someone asks "what will this specific patient's blood pressure be?", you need a PI. If they ask "what is the average blood pressure for people on this medication?", you need a CI. Most real-world decisions are about individuals, not averages — so PIs are often what you actually want.
  • Expecting the PI to shrink to zero: It won't. The \(\sigma^2\) floor is a hard limit. If your model's residual standard deviation is 10, your 95% PI will never be narrower than about \(\pm 20\), no matter how much data you collect.
  • Forgetting the PI gets wider in regression: At extreme \(x_h\) values (far from \(\bar{x}\)), the \((x_h - \bar{x})^2/S_{xx}\) term blows up. Extrapolation makes predictions even less reliable.
  • Reporting CIs and calling them "predictions": Many papers report CI bands around regression lines and claim they show prediction uncertainty. They don't — they show uncertainty about the mean. Real prediction bands are much wider.

History

  • 1908 — Gosset's \(t\)-distribution makes small-sample intervals possible — both confidence and prediction.
  • 1937 — Neyman formalizes confidence intervals, carefully distinguishing between estimating parameters and predicting observables. The prediction interval is the latter.
  • 1960s–70s — As regression becomes the workhorse of applied science, the distinction between confidence and prediction bands gains practical importance. Textbooks begin emphasizing the "+1" — but many practitioners still confuse the two.
  • Today — Machine learning has revived interest in prediction intervals under the name "prediction uncertainty" or "calibrated uncertainty." Modern methods (conformal prediction, Bayesian prediction) extend the idea beyond normality assumptions.

References

  • Kutner, M. H., et al. (2004). Applied Linear Statistical Models, 5th ed.
  • Hahn, G. J. & Meeker, W. Q. (1991). Statistical Intervals: A Guide for Practitioners. Wiley.
  • Morey, R. D., et al. (2016). "The fallacy of placing confidence in confidence intervals." Psychonomic Bulletin & Review.