Simple Linear Regression: Deriving the OLS Estimators¶
The Formulas¶
What They Mean¶
You have a cloud of data points and you want to draw the "best" straight line through them: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\). But what does "best" mean?
\(\hat{\beta}_1\) (the slope) tells you: for every one-unit increase in \(x\), how much does \(y\) change on average? \(\hat{\beta}_0\) (the intercept) tells you: where does the line cross the y-axis — what's the predicted \(y\) when \(x = 0\)?
These are the Ordinary Least Squares (OLS) estimators — the values of \(\beta_0\) and \(\beta_1\) that make the sum of squared residuals as small as possible. They are, in a precise sense, the line that disagrees with the data the least.
Why It Works — The Story Behind the Formulas¶
The Least Squares Idea¶
The year is 1805. Adrien-Marie Legendre publishes a book on determining comet orbits and, almost as a footnote, introduces a method he calls moindres carrés — least squares. The idea is elegant: if you can't draw a line that passes through every point, draw the one that minimizes the total squared distance from each point to the line.
Why squared distances? Three reasons, all of them good:
- Signs cancel: Some points are above the line (positive error), some below (negative). If you just added the errors, they'd cancel out and you'd think a terrible line was great.
- Big errors matter more: Squaring penalizes large errors disproportionately. A point that's 10 units off contributes 100 to the sum, not 10. This pulls the line toward outliers — which is both a feature and a bug.
- The math works out beautifully: Squared errors lead to clean, closed-form solutions. Absolute errors don't — you'd need iterative methods.
Gauss later claimed he'd been using the method since 1795 (before Legendre published), leading to one of the pettiest priority disputes in mathematical history. Either way, the method stuck.
The Setup¶
We have the model:
where \(\varepsilon_i\) are random errors. We want to find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize:
This is just a function of two variables. Calculus tells us: take partial derivatives, set them to zero, solve.
Deriving \(\hat{\beta}_1\) (The Slope)¶
Step 1: Partial derivative with respect to \(\beta_0\)
Divide by \(-2\) and expand:
Solving for \(\beta_0\):
This is already our intercept formula! It says: the regression line always passes through the point \((\bar{x}, \bar{y})\) — the centroid of the data. That's a beautiful geometric fact.
Step 2: Partial derivative with respect to \(\beta_1\)
Divide by \(-2\):
Step 3: Substitute (1) into (2)
Replace \(\beta_0\) with \(\bar{y} - \hat{\beta}_1 \bar{x}\):
Since \(\sum x_i = n\bar{x}\):
Rearranging:
Recognize that \(\sum x_i^2 - n\bar{x}^2 = \sum(x_i - \bar{x})^2 = S_{xx}\) and \(\sum x_i y_i - n\bar{x}\bar{y} = \sum(x_i - \bar{x})(y_i - \bar{y}) = S_{xy}\).
Therefore:
What This Formula Is Really Saying¶
Look at the slope formula again. The numerator \(S_{xy}\) measures how much \(x\) and \(y\) move together — it's the (unnormalized) covariance. The denominator \(S_{xx}\) measures how spread out the \(x\) values are.
The slope is literally: "how much do \(x\) and \(y\) co-vary, relative to how much \(x\) varies on its own?"
If \(x\) and \(y\) move together perfectly, the ratio is steep. If \(x\) varies a lot but \(y\) doesn't follow, the ratio is flat. If they move in opposite directions, the slope is negative. It's exactly the intuitive idea of "rise over run," but computed from noisy data.
Deriving \(\hat{\beta}_0\) (The Intercept)¶
We already found this in Step 1:
Geometrically: start at the centroid \((\bar{x}, \bar{y})\) and slide down the line to \(x = 0\). The intercept is wherever you land.
This also means if you center your data (subtract the means), the intercept vanishes. Centered regression has \(\hat{\beta}_0 = 0\) and only the slope matters.
Variables Explained¶
| Symbol | Name | Description |
|---|---|---|
| \(\hat{\beta}_1\) | Estimated slope | Change in \(\hat{y}\) for a one-unit change in \(x\) |
| \(\hat{\beta}_0\) | Estimated intercept | Predicted \(y\) when \(x = 0\) |
| \(S_{xy}\) | Sum of cross-deviations | \(\sum(x_i - \bar{x})(y_i - \bar{y})\), measures co-movement |
| \(S_{xx}\) | Sum of squared deviations in \(x\) | \(\sum(x_i - \bar{x})^2\), measures spread in \(x\) |
| \(\bar{x}, \bar{y}\) | Sample means | The centroid of the data |
| \(\varepsilon_i\) | Error term | Random noise in each observation |
Worked Example¶
Study Hours vs. Exam Score¶
| Student | Hours (\(x\)) | Score (\(y\)) |
|---|---|---|
| 1 | 2 | 65 |
| 2 | 3 | 70 |
| 3 | 5 | 80 |
| 4 | 7 | 85 |
| 5 | 8 | 90 |
Compute the means:
\(\bar{x} = \frac{2+3+5+7+8}{5} = 5, \quad \bar{y} = \frac{65+70+80+85+90}{5} = 78\)
Compute \(S_{xy}\) and \(S_{xx}\):
| \(x_i - \bar{x}\) | \(y_i - \bar{y}\) | \((x_i-\bar{x})(y_i-\bar{y})\) | \((x_i-\bar{x})^2\) |
|---|---|---|---|
| \(-3\) | \(-13\) | \(39\) | \(9\) |
| \(-2\) | \(-8\) | \(16\) | \(4\) |
| \(0\) | \(2\) | \(0\) | \(0\) |
| \(2\) | \(7\) | \(14\) | \(4\) |
| \(3\) | \(12\) | \(36\) | \(9\) |
\(S_{xy} = 105, \quad S_{xx} = 26\)
The slope:
Each additional hour of study predicts about 4 more points on the exam.
The intercept:
The regression line: \(\hat{y} = 57.8 + 4.04x\).
A student who studies 0 hours would score about 58 (take this with a grain of salt — extrapolation beyond the data range is risky).
Common Mistakes¶
- Interpreting the intercept literally: \(\hat{\beta}_0\) is where the line hits \(x = 0\), but if your data doesn't include \(x = 0\), this value is just extrapolation. A model predicting height from age might give \(\hat{\beta}_0 = 50\) cm, which makes sense for a newborn — but a model predicting salary from years of experience might give \(\hat{\beta}_0 = -\$20{,}000\), which is nonsense.
- Confusing correlation with causation: OLS finds the best linear fit. It says nothing about why \(x\) and \(y\) are related. Ice cream sales and drowning deaths are correlated (both rise in summer), but ice cream doesn't cause drowning.
- Forgetting that the line passes through \((\bar{x}, \bar{y})\): This is a useful sanity check. If your fitted line doesn't pass through the centroid, something went wrong.
Related Formulas¶
- SLR: Properties of the Slope Estimator — \(E(\hat{\beta}_1)\) and \(\text{Var}(\hat{\beta}_1)\)
- SLR: Properties of the Intercept Estimator — \(E(\hat{\beta}_0)\) and \(\text{Var}(\hat{\beta}_0)\)
- SLR: Mean Response and Prediction — confidence and prediction intervals
- Gaussian Distribution — the error distribution assumed in SLR
History¶
- 1805 — Adrien-Marie Legendre publishes the method of least squares in Nouvelles méthodes pour la détermination des orbites des comètes
- 1809 — Gauss claims he'd been using it since 1795 and publishes his own derivation in Theoria Motus, connecting least squares to the normal distribution
- 1821–1823 — Gauss proves the Gauss-Markov theorem: under certain conditions, OLS gives the best (minimum variance) linear unbiased estimators. This is why we use least squares, not just because the math is pretty
- 1886 — Francis Galton coins the term "regression" while studying how children's heights "regress toward the mean" relative to their parents — giving us the name we still use today
References¶
- Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
- Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.
- Weisberg, S. (2014). Applied Linear Regression, 4th ed. Wiley.