SLR: Properties of the Slope Estimator¶

The Formulas¶

\[ E(\hat{\beta}_1) = \beta_1 \]

\[ \text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}} = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

What They Mean¶

The first equation says: \(\hat{\beta}_1\) is unbiased. If you could repeat your experiment infinitely many times — collecting new data each time and computing a new slope — the average of all those slopes would be exactly the true slope \(\beta_1\). Your estimator doesn't systematically overshoot or undershoot. On average, it's dead on.

The second equation tells you how much \(\hat{\beta}_1\) would bounce around across all those hypothetical repetitions. Every time you collect new data, you'd get a slightly different slope. The variance formula tells you how spread out those slopes would be.

Two things make \(\hat{\beta}_1\) more precise: smaller noise (\(\sigma^2\)) and more spread in \(x\) (\(S_{xx}\)). Both are deeply intuitive once you see why.

Why It Works — The Intuition¶

First: Rewrite \(\hat{\beta}_1\) as a Linear Combination of \(y_i\)'s¶

This is the key insight that unlocks everything. The slope estimator looks like a complicated ratio, but it's secretly just a weighted sum of the \(y\) values:

\[ \hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{S_{xx}} \]

With some algebra (expanding and using the fact that \(\sum(x_i - \bar{x}) = 0\)), this simplifies to:

\[ \hat{\beta}_1 = \sum_{i=1}^n c_i \, y_i \quad \text{where} \quad c_i = \frac{x_i - \bar{x}}{S_{xx}} \]

This is huge. The slope is a linear function of the \(y_i\)'s. The weights \(c_i\) depend only on the \(x\) values, which we treat as fixed. Points with \(x\) values far from \(\bar{x}\) get more weight — they have more "leverage" on the slope. Points near \(\bar{x}\) barely influence it at all.

Let's verify the weights make sense:

\[ \sum c_i = \frac{\sum(x_i - \bar{x})}{S_{xx}} = \frac{0}{S_{xx}} = 0 \]

The weights sum to zero — positive for \(x_i > \bar{x}\), negative for \(x_i < \bar{x}\). This is exactly what a slope should do: it contrasts the \(y\) values on the right side of the data against the \(y\) values on the left.

Proving \(E(\hat{\beta}_1) = \beta_1\) (Unbiasedness)¶

Since \(\hat{\beta}_1 = \sum c_i y_i\) and each \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\):

\[ \hat{\beta}_1 = \sum c_i (\beta_0 + \beta_1 x_i + \varepsilon_i) = \beta_0 \sum c_i + \beta_1 \sum c_i x_i + \sum c_i \varepsilon_i \]

We already know \(\sum c_i = 0\), so the \(\beta_0\) term vanishes. For the middle term:

\[ \sum c_i x_i = \sum \frac{(x_i - \bar{x})}{S_{xx}} \cdot x_i = \frac{\sum (x_i - \bar{x}) x_i}{S_{xx}} = \frac{\sum (x_i - \bar{x})(x_i - \bar{x}) + \bar{x}\sum(x_i - \bar{x})}{S_{xx}} = \frac{S_{xx}}{S_{xx}} = 1 \]

So:

\[ \hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i \]

Taking the expected value, since \(E(\varepsilon_i) = 0\):

\[ \boxed{E(\hat{\beta}_1) = \beta_1} \]

The true slope, nothing more, nothing less. The random errors wash out in expectation because the weights \(c_i\) are balanced — they sum to zero.

The beauty of this proof: \(\hat{\beta}_1\) equals the true slope plus a weighted sum of noise terms. On average, the noise disappears. In any single sample, it doesn't — which is why we need the variance.

Deriving \(\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}\)¶

From \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\), the variance is:

\[ \text{Var}(\hat{\beta}_1) = \text{Var}\left(\sum c_i \varepsilon_i\right) \]

Since the \(\varepsilon_i\) are independent, each with variance \(\sigma^2\):

\[ \text{Var}(\hat{\beta}_1) = \sum c_i^2 \, \text{Var}(\varepsilon_i) = \sigma^2 \sum c_i^2 \]

Now compute \(\sum c_i^2\):

\[ \sum c_i^2 = \sum \left(\frac{x_i - \bar{x}}{S_{xx}}\right)^2 = \frac{\sum(x_i - \bar{x})^2}{S_{xx}^2} = \frac{S_{xx}}{S_{xx}^2} = \frac{1}{S_{xx}} \]

Therefore:

\[ \boxed{\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}} \]

Why This Makes Sense¶

Look at what controls the variance:

\(\sigma^2\) in the numerator: More noise in the data means more uncertainty in the slope. If every measurement is contaminated by huge random errors, you can't pin down the slope precisely. This is unavoidable — you inherit the messiness of the world.

\(S_{xx}\) in the denominator: More spread in your \(x\) values means a more precise slope. This is deeply intuitive. Imagine estimating the slope of a hill:

If you measure elevation at two points 1 meter apart, a small measurement error could make the hill look flat or steep.
If you measure at two points 1 kilometer apart, the same measurement error barely changes the estimated slope.

Spread in \(x\) gives you a longer "baseline" against which to see the signal through the noise. This is why experimenters deliberately choose \(x\) values that are far apart — it's the single most effective way to reduce the variance of \(\hat{\beta}_1\).

\(n\) is hidden inside \(S_{xx}\): Since \(S_{xx} = \sum(x_i - \bar{x})^2\), adding more data points generally increases \(S_{xx}\). If the \(x\) values are evenly spaced with spacing \(d\), then \(S_{xx} \approx \frac{n^3 d^2}{12}\) — it grows roughly as \(n^3\), not \(n\). So the variance of the slope decreases faster than \(1/n\). More data helps, and spread-out data helps even more.

Step-by-Step Summary¶

Rewrite \(\hat{\beta}_1 = \sum c_i y_i\) with \(c_i = \frac{x_i - \bar{x}}{S_{xx}}\)
Substitute \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
Use \(\sum c_i = 0\) and \(\sum c_i x_i = 1\) to get \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\)
Take expectation: \(E(\hat{\beta}_1) = \beta_1\) (since \(E(\varepsilon_i) = 0\))
Take variance: \(\text{Var}(\hat{\beta}_1) = \sigma^2 \sum c_i^2 = \sigma^2 / S_{xx}\)

Variables Explained¶

Symbol	Name	Description
\(\hat{\beta}_1\)	OLS slope estimator	The slope computed from a particular sample
\(\beta_1\)	True slope	The population parameter we're trying to estimate
\(\sigma^2\)	Error variance	Variance of the random errors \(\varepsilon_i\)
\(S_{xx}\)	Sum of squared deviations in \(x\)	\(\sum(x_i - \bar{x})^2\) — measures spread in \(x\)
\(c_i\)	Weights	\(\frac{x_i - \bar{x}}{S_{xx}}\) — how much each \(y_i\) influences the slope

Worked Example¶

Using the study hours data from the OLS Estimators page:

\(S_{xx} = 26\), and suppose \(\sigma^2 = 16\) (known or estimated).

\[ \text{Var}(\hat{\beta}_1) = \frac{16}{26} \approx 0.615 \]

\[ \text{SE}(\hat{\beta}_1) = \sqrt{0.615} \approx 0.784 \]

Our slope was \(\hat{\beta}_1 \approx 4.04\). A 95% confidence interval (using \(t_{3, 0.025} \approx 3.18\) for \(n-2 = 3\) df):

\[ 4.04 \pm 3.18 \times 0.784 = 4.04 \pm 2.49 = (1.55, \; 6.53) \]

The true slope is probably between 1.55 and 6.53 points per hour. With only 5 data points and \(S_{xx} = 26\), we can't pin it down very tightly. If we had 20 students with the same \(x\)-spread, \(S_{xx}\) would be much larger and the interval much narrower.

Common Mistakes¶

Forgetting that \(x\) values are treated as fixed: In the derivation, all the randomness comes from the \(\varepsilon_i\)'s. The \(x\) values are fixed constants (or we condition on them). This is why \(c_i\) are constants, not random variables.
Thinking more data always helps equally: The variance depends on \(S_{xx}\), not just \(n\). Adding 10 data points all near \(\bar{x}\) barely helps. Adding 2 data points at the extremes helps a lot. The spread of \(x\) matters more than the count.
Using \(\sigma^2\) when you have \(s^2\): In practice, \(\sigma^2\) is unknown and estimated by \(s^2 = \frac{\sum e_i^2}{n-2}\) (the MSE). When you plug in \(s^2\), you must use the \(t\)-distribution instead of the normal for inference.

SLR: Deriving the OLS Estimators — where \(\hat{\beta}_1\) comes from
SLR: Properties of the Intercept Estimator — the companion derivation for \(\hat{\beta}_0\)
SLR: Mean Response and Prediction — using the slope to build confidence bands
Standard Error — the general idea of estimator variability

History¶

1821–1823 — Gauss proves the Gauss-Markov theorem: among all linear unbiased estimators, OLS has the smallest variance. This means \(\hat{\beta}_1\) isn't just unbiased — it's the best linear unbiased estimator (BLUE). You literally cannot do better without either accepting bias or going nonlinear.
1908 — Gosset's t-distribution provides the correct distribution for \(\frac{\hat{\beta}_1 - \beta_1}{\text{SE}(\hat{\beta}_1)}\) when \(\sigma^2\) is estimated — enabling valid inference with small samples.

References¶

Gauss, C. F. (1821–1823). Theoria combinationis observationum erroribus minimis obnoxiae.
Weisberg, S. (2014). Applied Linear Regression, 4th ed.
Kutner, M. H., et al. (2004). Applied Linear Statistical Models, 5th ed.