Skip to content

SLR: Properties of the Slope Estimator

The Formulas

\[ E(\hat{\beta}_1) = \beta_1 \]
\[ \text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}} = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

What They Mean

The first equation says: \(\hat{\beta}_1\) is unbiased. If you could repeat your experiment infinitely many times — collecting new data each time and computing a new slope — the average of all those slopes would be exactly the true slope \(\beta_1\). Your estimator doesn't systematically overshoot or undershoot. On average, it's dead on.

The second equation tells you how much \(\hat{\beta}_1\) would bounce around across all those hypothetical repetitions. Every time you collect new data, you'd get a slightly different slope. The variance formula tells you how spread out those slopes would be.

Two things make \(\hat{\beta}_1\) more precise: smaller noise (\(\sigma^2\)) and more spread in \(x\) (\(S_{xx}\)). Both are deeply intuitive once you see why.

Why It Works — The Intuition

First: Rewrite \(\hat{\beta}_1\) as a Linear Combination of \(y_i\)'s

This is the key insight that unlocks everything. The slope estimator looks like a complicated ratio, but it's secretly just a weighted sum of the \(y\) values:

\[ \hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{S_{xx}} \]

With some algebra (expanding and using the fact that \(\sum(x_i - \bar{x}) = 0\)), this simplifies to:

\[ \hat{\beta}_1 = \sum_{i=1}^n c_i \, y_i \quad \text{where} \quad c_i = \frac{x_i - \bar{x}}{S_{xx}} \]

This is huge. The slope is a linear function of the \(y_i\)'s. The weights \(c_i\) depend only on the \(x\) values, which we treat as fixed. Points with \(x\) values far from \(\bar{x}\) get more weight — they have more "leverage" on the slope. Points near \(\bar{x}\) barely influence it at all.

Let's verify the weights make sense:

\[ \sum c_i = \frac{\sum(x_i - \bar{x})}{S_{xx}} = \frac{0}{S_{xx}} = 0 \]

The weights sum to zero — positive for \(x_i > \bar{x}\), negative for \(x_i < \bar{x}\). This is exactly what a slope should do: it contrasts the \(y\) values on the right side of the data against the \(y\) values on the left.

Proving \(E(\hat{\beta}_1) = \beta_1\) (Unbiasedness)

Since \(\hat{\beta}_1 = \sum c_i y_i\) and each \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\):

\[ \hat{\beta}_1 = \sum c_i (\beta_0 + \beta_1 x_i + \varepsilon_i) = \beta_0 \sum c_i + \beta_1 \sum c_i x_i + \sum c_i \varepsilon_i \]

We already know \(\sum c_i = 0\), so the \(\beta_0\) term vanishes. For the middle term:

\[ \sum c_i x_i = \sum \frac{(x_i - \bar{x})}{S_{xx}} \cdot x_i = \frac{\sum (x_i - \bar{x}) x_i}{S_{xx}} = \frac{\sum (x_i - \bar{x})(x_i - \bar{x}) + \bar{x}\sum(x_i - \bar{x})}{S_{xx}} = \frac{S_{xx}}{S_{xx}} = 1 \]

So:

\[ \hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i \]

Taking the expected value, since \(E(\varepsilon_i) = 0\):

\[ \boxed{E(\hat{\beta}_1) = \beta_1} \]

The true slope, nothing more, nothing less. The random errors wash out in expectation because the weights \(c_i\) are balanced — they sum to zero.

The beauty of this proof: \(\hat{\beta}_1\) equals the true slope plus a weighted sum of noise terms. On average, the noise disappears. In any single sample, it doesn't — which is why we need the variance.

Deriving \(\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}\)

From \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\), the variance is:

\[ \text{Var}(\hat{\beta}_1) = \text{Var}\left(\sum c_i \varepsilon_i\right) \]

Since the \(\varepsilon_i\) are independent, each with variance \(\sigma^2\):

\[ \text{Var}(\hat{\beta}_1) = \sum c_i^2 \, \text{Var}(\varepsilon_i) = \sigma^2 \sum c_i^2 \]

Now compute \(\sum c_i^2\):

\[ \sum c_i^2 = \sum \left(\frac{x_i - \bar{x}}{S_{xx}}\right)^2 = \frac{\sum(x_i - \bar{x})^2}{S_{xx}^2} = \frac{S_{xx}}{S_{xx}^2} = \frac{1}{S_{xx}} \]

Therefore:

\[ \boxed{\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}} \]

Why This Makes Sense

Look at what controls the variance:

\(\sigma^2\) in the numerator: More noise in the data means more uncertainty in the slope. If every measurement is contaminated by huge random errors, you can't pin down the slope precisely. This is unavoidable — you inherit the messiness of the world.

\(S_{xx}\) in the denominator: More spread in your \(x\) values means a more precise slope. This is deeply intuitive. Imagine estimating the slope of a hill:

  • If you measure elevation at two points 1 meter apart, a small measurement error could make the hill look flat or steep.
  • If you measure at two points 1 kilometer apart, the same measurement error barely changes the estimated slope.

Spread in \(x\) gives you a longer "baseline" against which to see the signal through the noise. This is why experimenters deliberately choose \(x\) values that are far apart — it's the single most effective way to reduce the variance of \(\hat{\beta}_1\).

\(n\) is hidden inside \(S_{xx}\): Since \(S_{xx} = \sum(x_i - \bar{x})^2\), adding more data points generally increases \(S_{xx}\). If the \(x\) values are evenly spaced with spacing \(d\), then \(S_{xx} \approx \frac{n^3 d^2}{12}\) — it grows roughly as \(n^3\), not \(n\). So the variance of the slope decreases faster than \(1/n\). More data helps, and spread-out data helps even more.

Step-by-Step Summary

  1. Rewrite \(\hat{\beta}_1 = \sum c_i y_i\) with \(c_i = \frac{x_i - \bar{x}}{S_{xx}}\)
  2. Substitute \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
  3. Use \(\sum c_i = 0\) and \(\sum c_i x_i = 1\) to get \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\)
  4. Take expectation: \(E(\hat{\beta}_1) = \beta_1\) (since \(E(\varepsilon_i) = 0\))
  5. Take variance: \(\text{Var}(\hat{\beta}_1) = \sigma^2 \sum c_i^2 = \sigma^2 / S_{xx}\)

Variables Explained

Symbol Name Description
\(\hat{\beta}_1\) OLS slope estimator The slope computed from a particular sample
\(\beta_1\) True slope The population parameter we're trying to estimate
\(\sigma^2\) Error variance Variance of the random errors \(\varepsilon_i\)
\(S_{xx}\) Sum of squared deviations in \(x\) \(\sum(x_i - \bar{x})^2\) — measures spread in \(x\)
\(c_i\) Weights \(\frac{x_i - \bar{x}}{S_{xx}}\) — how much each \(y_i\) influences the slope

Worked Example

Using the study hours data from the OLS Estimators page:

\(S_{xx} = 26\), and suppose \(\sigma^2 = 16\) (known or estimated).

\[ \text{Var}(\hat{\beta}_1) = \frac{16}{26} \approx 0.615 \]
\[ \text{SE}(\hat{\beta}_1) = \sqrt{0.615} \approx 0.784 \]

Our slope was \(\hat{\beta}_1 \approx 4.04\). A 95% confidence interval (using \(t_{3, 0.025} \approx 3.18\) for \(n-2 = 3\) df):

\[ 4.04 \pm 3.18 \times 0.784 = 4.04 \pm 2.49 = (1.55, \; 6.53) \]

The true slope is probably between 1.55 and 6.53 points per hour. With only 5 data points and \(S_{xx} = 26\), we can't pin it down very tightly. If we had 20 students with the same \(x\)-spread, \(S_{xx}\) would be much larger and the interval much narrower.

Common Mistakes

  • Forgetting that \(x\) values are treated as fixed: In the derivation, all the randomness comes from the \(\varepsilon_i\)'s. The \(x\) values are fixed constants (or we condition on them). This is why \(c_i\) are constants, not random variables.
  • Thinking more data always helps equally: The variance depends on \(S_{xx}\), not just \(n\). Adding 10 data points all near \(\bar{x}\) barely helps. Adding 2 data points at the extremes helps a lot. The spread of \(x\) matters more than the count.
  • Using \(\sigma^2\) when you have \(s^2\): In practice, \(\sigma^2\) is unknown and estimated by \(s^2 = \frac{\sum e_i^2}{n-2}\) (the MSE). When you plug in \(s^2\), you must use the \(t\)-distribution instead of the normal for inference.

History

  • 1821–1823 — Gauss proves the Gauss-Markov theorem: among all linear unbiased estimators, OLS has the smallest variance. This means \(\hat{\beta}_1\) isn't just unbiased — it's the best linear unbiased estimator (BLUE). You literally cannot do better without either accepting bias or going nonlinear.
  • 1908 — Gosset's t-distribution provides the correct distribution for \(\frac{\hat{\beta}_1 - \beta_1}{\text{SE}(\hat{\beta}_1)}\) when \(\sigma^2\) is estimated — enabling valid inference with small samples.

References

  • Gauss, C. F. (1821–1823). Theoria combinationis observationum erroribus minimis obnoxiae.
  • Weisberg, S. (2014). Applied Linear Regression, 4th ed.
  • Kutner, M. H., et al. (2004). Applied Linear Statistical Models, 5th ed.