Skip to content

SLR: Mean Response and Prediction

The Formulas

Estimated Mean Response

\[ \hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 x_h \]

Variance of the Mean Response

\[ \text{Var}(\hat{Y}_h) = \sigma^2 \left(\frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}\right) \]

Confidence Interval for the Mean Response \(E(Y | X = x_h)\)

\[ \hat{Y}_h \pm t_{n-2, \alpha/2} \cdot s \sqrt{\frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}} \]

Prediction Interval for a New Observation \(Y_{\text{new}}\) at \(x_h\)

\[ \hat{Y}_h \pm t_{n-2, \alpha/2} \cdot s \sqrt{1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}} \]

What They Mean

There are two very different questions you can ask a regression model, and confusing them is one of the most common mistakes in statistics:

  1. "What's the average \(y\) for this value of \(x\)?" — This asks about the mean response \(E(Y | X = x_h)\). You're estimating the center of the distribution at \(x_h\). The confidence interval answers this.

  2. "What will the next observation be at this \(x\)?" — This asks about a single new data point. Even if you knew the mean perfectly, individual observations scatter around it. The prediction interval answers this.

The prediction interval is always wider — it includes both the uncertainty about the mean and the random scatter of individuals around that mean.

Why It Works — The Intuition

The Variance of \(\hat{Y}_h\) — Why the Bowtie Shape?

If you've ever seen a regression plot with confidence bands, you noticed they form a bowtie (or hourglass) shape — narrowest at \(\bar{x}\) and widening as you move away. The variance formula explains why.

\[ \text{Var}(\hat{Y}_h) = \sigma^2 \left(\frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}\right) \]

At \(x_h = \bar{x}\): The second term vanishes. \(\text{Var}(\hat{Y}_h) = \sigma^2/n\). You're predicting the mean response at the centroid, where you have maximum information. The only uncertainty comes from estimating \(\bar{y}\).

As \(x_h\) moves away from \(\bar{x}\): The \((x_h - \bar{x})^2\) term grows. You're extrapolating along the regression line, and slope uncertainty gets amplified by the distance from the center. It's the see-saw effect again — the line pivots on the centroid, so points far from the center swing more.

This is exactly why extrapolation is dangerous. At \(x_h\) far from your data, the confidence band explodes. The model isn't necessarily wrong, but it's honest about how little it knows out there.

Full Derivation of \(\text{Var}(\hat{Y}_h)\)

\[ \hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 x_h = (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 x_h = \bar{y} + \hat{\beta}_1(x_h - \bar{x}) \]

This rewrite is beautiful. The predicted value is the sample mean \(\bar{y}\) plus a slope correction based on how far \(x_h\) is from \(\bar{x}\).

Now, \(\bar{y}\) and \(\hat{\beta}_1\) are both functions of the \(y_i\)'s. Are they independent? Yes — and this is a remarkable fact. \(\bar{y}\) captures the "level" of the data while \(\hat{\beta}_1\) captures the "tilt," and for normally distributed errors these are independent.

Since they're independent:

\[ \text{Var}(\hat{Y}_h) = \text{Var}(\bar{y}) + (x_h - \bar{x})^2 \text{Var}(\hat{\beta}_1) \]
\[ = \frac{\sigma^2}{n} + (x_h - \bar{x})^2 \cdot \frac{\sigma^2}{S_{xx}} \]
\[ \boxed{= \sigma^2\left(\frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}\right)} \]

Two sources of uncertainty, added together: - \(\sigma^2/n\): uncertainty in the overall level (the mean) - \(\sigma^2(x_h - \bar{x})^2/S_{xx}\): uncertainty from the slope, amplified by distance from center

Confidence Interval vs. Prediction Interval — The Extra "1"

Now suppose you want to predict a new individual observation \(Y_{\text{new}} = \beta_0 + \beta_1 x_h + \varepsilon_{\text{new}}\) at the point \(x_h\).

The prediction error is:

\[ Y_{\text{new}} - \hat{Y}_h = (\beta_0 + \beta_1 x_h + \varepsilon_{\text{new}}) - \hat{Y}_h \]

The variance of this error has three components:

\[ \text{Var}(Y_{\text{new}} - \hat{Y}_h) = \underbrace{\text{Var}(\varepsilon_{\text{new}})}_{\sigma^2} + \underbrace{\text{Var}(\hat{Y}_h)}_{\sigma^2\left(\frac{1}{n} + \frac{(x_h-\bar{x})^2}{S_{xx}}\right)} \]

The cross term vanishes because \(\varepsilon_{\text{new}}\) is independent of the training data (it's a new observation). So:

\[ \text{Var}(Y_{\text{new}} - \hat{Y}_h) = \sigma^2\left(1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{S_{xx}}\right) \]

That leading "1" is the irreducible noise — even with infinite data and a perfect regression line, individual observations still scatter around the mean with variance \(\sigma^2\). The confidence interval shrinks to zero as \(n \to \infty\). The prediction interval never gets narrower than \(\pm t \cdot \sigma\) — there's a floor set by the inherent randomness of the world.

This is a profound distinction: - Confidence interval: "Where is the true average?" → Gets arbitrarily precise with more data - Prediction interval: "Where will the next point land?" → Has an irreducible minimum width

The Bowtie Gets Wider

Both intervals widen as \(x_h\) moves from \(\bar{x}\), but the prediction interval starts wider (because of the +1) and grows at the same rate. Visually:

            Prediction bands (wider)
          /                           \
        /    Confidence bands           \
      /    /                     \        \
    /    /                         \        \
---/---/---------- x̄ ----------------\--------\---
    \    \                         /        /
      \    \                     /        /
        \    Confidence bands           /
          \                           /
            Prediction bands

Worked Example

Study hours data: \(n = 5\), \(\bar{x} = 5\), \(S_{xx} = 26\), \(\hat{\beta}_0 = 57.8\), \(\hat{\beta}_1 = 4.04\), \(s^2 = 4.53\) (estimated), so \(s = 2.13\).

Predict the mean score at \(x_h = 6\) hours

\[ \hat{Y}_h = 57.8 + 4.04(6) = 82.04 \]

Confidence interval (95%, \(t_{3, 0.025} = 3.182\)):

\[ s\sqrt{\frac{1}{5} + \frac{(6-5)^2}{26}} = 2.13\sqrt{0.2 + 0.038} = 2.13 \times 0.488 = 1.04 \]
\[ 82.04 \pm 3.182 \times 1.04 = 82.04 \pm 3.31 = (78.7, \; 85.4) \]

We're 95% confident the average score for students who study 6 hours is between 78.7 and 85.4.

Prediction interval (same setup):

\[ s\sqrt{1 + \frac{1}{5} + \frac{(6-5)^2}{26}} = 2.13\sqrt{1.238} = 2.13 \times 1.113 = 2.37 \]
\[ 82.04 \pm 3.182 \times 2.37 = 82.04 \pm 7.54 = (74.5, \; 89.6) \]

An individual student who studies 6 hours would probably score between 74.5 and 89.6 — a much wider range, because people vary.

Compare at \(x_h = 2\) (near the edge)

\[ \hat{Y}_h = 57.8 + 4.04(2) = 65.88 \]

Confidence SE: \(s\sqrt{\frac{1}{5} + \frac{(2-5)^2}{26}} = 2.13\sqrt{0.2 + 0.346} = 2.13 \times 0.739 = 1.57\)

CI: \(65.88 \pm 3.182 \times 1.57 = 65.88 \pm 5.00 = (60.9, \; 70.9)\)

Wider than at \(x_h = 6\), because we're farther from \(\bar{x} = 5\). The bowtie opens up.

Common Mistakes

  • Using a confidence interval when you need a prediction interval: If someone asks "what score will this student get?", you need a prediction interval. If they ask "what's the average score for students who study 6 hours?", you need a confidence interval. The first is about an individual; the second is about a population average.
  • Extrapolating confidently: The bands widen for a reason. Predicting at \(x_h = 20\) hours with data from 2–8 hours is reckless. The formula will give you a number, but the model has no idea if the linear relationship holds that far out.
  • Forgetting the prediction interval has a floor: No amount of data makes the prediction interval vanish. There's always \(\sigma^2\) of irreducible uncertainty. If \(\sigma\) is large, your model might be "correct" but still not very useful for individual predictions.
  • Thinking narrow confidence bands mean good predictions: You can have razor-thin confidence bands (you know the mean precisely) and still have wide prediction bands (individuals scatter a lot). These are separate questions.

History

  • 1805–1809 — Legendre and Gauss develop least squares, but the focus is on point estimates — how to find the "best" line, not how uncertain it is
  • 1821–1823 — Gauss derives the variance formulas for the estimators, giving us the tools to quantify uncertainty in regression
  • 1908 — Gosset's t-distribution makes small-sample inference possible — crucial because early regression applications often had tiny datasets
  • 1930s — Working and Hotelling independently derive the confidence bands for the entire regression line — the mathematical basis for the bowtie shape. Working showed that the simultaneous confidence band for the entire line requires the \(F\)-distribution, not just pointwise \(t\)-intervals

References

  • Kutner, M. H., et al. (2004). Applied Linear Statistical Models, 5th ed.
  • Weisberg, S. (2014). Applied Linear Regression, 4th ed.
  • Working, H. & Hotelling, H. (1929). "Applications of the Theory of Error to the Interpretation of Trends." J. Amer. Statist. Assoc.