SLR: Properties of the Slope Estimator¶
The Formulas¶
What They Mean¶
The first equation says: \(\hat{\beta}_1\) is unbiased. If you could repeat your experiment infinitely many times — collecting new data each time and computing a new slope — the average of all those slopes would be exactly the true slope \(\beta_1\). Your estimator doesn't systematically overshoot or undershoot. On average, it's dead on.
The second equation tells you how much \(\hat{\beta}_1\) would bounce around across all those hypothetical repetitions. Every time you collect new data, you'd get a slightly different slope. The variance formula tells you how spread out those slopes would be.
Two things make \(\hat{\beta}_1\) more precise: smaller noise (\(\sigma^2\)) and more spread in \(x\) (\(S_{xx}\)). Both are deeply intuitive once you see why.
Why It Works — The Intuition¶
First: Rewrite \(\hat{\beta}_1\) as a Linear Combination of \(y_i\)'s¶
This is the key insight that unlocks everything. The slope estimator looks like a complicated ratio, but it's secretly just a weighted sum of the \(y\) values:
With some algebra (expanding and using the fact that \(\sum(x_i - \bar{x}) = 0\)), this simplifies to:
This is huge. The slope is a linear function of the \(y_i\)'s. The weights \(c_i\) depend only on the \(x\) values, which we treat as fixed. Points with \(x\) values far from \(\bar{x}\) get more weight — they have more "leverage" on the slope. Points near \(\bar{x}\) barely influence it at all.
Let's verify the weights make sense:
The weights sum to zero — positive for \(x_i > \bar{x}\), negative for \(x_i < \bar{x}\). This is exactly what a slope should do: it contrasts the \(y\) values on the right side of the data against the \(y\) values on the left.
Proving \(E(\hat{\beta}_1) = \beta_1\) (Unbiasedness)¶
Since \(\hat{\beta}_1 = \sum c_i y_i\) and each \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\):
We already know \(\sum c_i = 0\), so the \(\beta_0\) term vanishes. For the middle term:
So:
Taking the expected value, since \(E(\varepsilon_i) = 0\):
The true slope, nothing more, nothing less. The random errors wash out in expectation because the weights \(c_i\) are balanced — they sum to zero.
The beauty of this proof: \(\hat{\beta}_1\) equals the true slope plus a weighted sum of noise terms. On average, the noise disappears. In any single sample, it doesn't — which is why we need the variance.
Deriving \(\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}\)¶
From \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\), the variance is:
Since the \(\varepsilon_i\) are independent, each with variance \(\sigma^2\):
Now compute \(\sum c_i^2\):
Therefore:
Why This Makes Sense¶
Look at what controls the variance:
\(\sigma^2\) in the numerator: More noise in the data means more uncertainty in the slope. If every measurement is contaminated by huge random errors, you can't pin down the slope precisely. This is unavoidable — you inherit the messiness of the world.
\(S_{xx}\) in the denominator: More spread in your \(x\) values means a more precise slope. This is deeply intuitive. Imagine estimating the slope of a hill:
- If you measure elevation at two points 1 meter apart, a small measurement error could make the hill look flat or steep.
- If you measure at two points 1 kilometer apart, the same measurement error barely changes the estimated slope.
Spread in \(x\) gives you a longer "baseline" against which to see the signal through the noise. This is why experimenters deliberately choose \(x\) values that are far apart — it's the single most effective way to reduce the variance of \(\hat{\beta}_1\).
\(n\) is hidden inside \(S_{xx}\): Since \(S_{xx} = \sum(x_i - \bar{x})^2\), adding more data points generally increases \(S_{xx}\). If the \(x\) values are evenly spaced with spacing \(d\), then \(S_{xx} \approx \frac{n^3 d^2}{12}\) — it grows roughly as \(n^3\), not \(n\). So the variance of the slope decreases faster than \(1/n\). More data helps, and spread-out data helps even more.
Step-by-Step Summary¶
- Rewrite \(\hat{\beta}_1 = \sum c_i y_i\) with \(c_i = \frac{x_i - \bar{x}}{S_{xx}}\)
- Substitute \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
- Use \(\sum c_i = 0\) and \(\sum c_i x_i = 1\) to get \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\)
- Take expectation: \(E(\hat{\beta}_1) = \beta_1\) (since \(E(\varepsilon_i) = 0\))
- Take variance: \(\text{Var}(\hat{\beta}_1) = \sigma^2 \sum c_i^2 = \sigma^2 / S_{xx}\)
Variables Explained¶
| Symbol | Name | Description |
|---|---|---|
| \(\hat{\beta}_1\) | OLS slope estimator | The slope computed from a particular sample |
| \(\beta_1\) | True slope | The population parameter we're trying to estimate |
| \(\sigma^2\) | Error variance | Variance of the random errors \(\varepsilon_i\) |
| \(S_{xx}\) | Sum of squared deviations in \(x\) | \(\sum(x_i - \bar{x})^2\) — measures spread in \(x\) |
| \(c_i\) | Weights | \(\frac{x_i - \bar{x}}{S_{xx}}\) — how much each \(y_i\) influences the slope |
Worked Example¶
Using the study hours data from the OLS Estimators page:
\(S_{xx} = 26\), and suppose \(\sigma^2 = 16\) (known or estimated).
Our slope was \(\hat{\beta}_1 \approx 4.04\). A 95% confidence interval (using \(t_{3, 0.025} \approx 3.18\) for \(n-2 = 3\) df):
The true slope is probably between 1.55 and 6.53 points per hour. With only 5 data points and \(S_{xx} = 26\), we can't pin it down very tightly. If we had 20 students with the same \(x\)-spread, \(S_{xx}\) would be much larger and the interval much narrower.
Common Mistakes¶
- Forgetting that \(x\) values are treated as fixed: In the derivation, all the randomness comes from the \(\varepsilon_i\)'s. The \(x\) values are fixed constants (or we condition on them). This is why \(c_i\) are constants, not random variables.
- Thinking more data always helps equally: The variance depends on \(S_{xx}\), not just \(n\). Adding 10 data points all near \(\bar{x}\) barely helps. Adding 2 data points at the extremes helps a lot. The spread of \(x\) matters more than the count.
- Using \(\sigma^2\) when you have \(s^2\): In practice, \(\sigma^2\) is unknown and estimated by \(s^2 = \frac{\sum e_i^2}{n-2}\) (the MSE). When you plug in \(s^2\), you must use the \(t\)-distribution instead of the normal for inference.
Related Formulas¶
- SLR: Deriving the OLS Estimators — where \(\hat{\beta}_1\) comes from
- SLR: Properties of the Intercept Estimator — the companion derivation for \(\hat{\beta}_0\)
- SLR: Mean Response and Prediction — using the slope to build confidence bands
- Standard Error — the general idea of estimator variability
History¶
- 1821–1823 — Gauss proves the Gauss-Markov theorem: among all linear unbiased estimators, OLS has the smallest variance. This means \(\hat{\beta}_1\) isn't just unbiased — it's the best linear unbiased estimator (BLUE). You literally cannot do better without either accepting bias or going nonlinear.
- 1908 — Gosset's t-distribution provides the correct distribution for \(\frac{\hat{\beta}_1 - \beta_1}{\text{SE}(\hat{\beta}_1)}\) when \(\sigma^2\) is estimated — enabling valid inference with small samples.
References¶
- Gauss, C. F. (1821–1823). Theoria combinationis observationum erroribus minimis obnoxiae.
- Weisberg, S. (2014). Applied Linear Regression, 4th ed.
- Kutner, M. H., et al. (2004). Applied Linear Statistical Models, 5th ed.