Skip to content

Cook's Distance

The Formula

\[ D_i = \frac{\sum_{j=1}^{n} (\hat{y}_j - \hat{y}_{j(i)})^2}{p \cdot MSE} \]

Where \(\hat{y}_{j(i)}\) is the predicted value for observation \(j\) when observation \(i\) has been removed from the dataset.

An equivalent and more practical formula is:

\[ D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{(1 - h_{ii})^2} \]

What It Means

Cook's Distance measures how much all the fitted values in a regression change when a single observation is deleted. It combines two things:

  1. How large the residual is (how badly the model fits that point)
  2. How much leverage the point has (how far it is from the center of the data)

A point with a large Cook's Distance is called an influential point — removing it would noticeably change the regression results. Think of it as asking: "If I erased this one data point, would my entire regression line shift?"

Why It Works — The Intuition

Imagine fitting a regression line through your data. Now imagine removing one point and refitting. If the line barely moves, that point wasn't very influential. If the line shifts dramatically, that point was pulling the line toward itself.

Cook's Distance formalizes this idea. It measures the "distance" between the full set of predictions \(\hat{y}\) and the predictions \(\hat{y}_{(i)}\) you'd get without observation \(i\).

The key insight is that a point can be influential in two ways:

  • It has a large residual — the model already fits it poorly, suggesting it's unusual in the \(y\)-direction.
  • It has high leverage — it's far from the center of the predictor space (unusual in the \(x\)-direction), so it "pulls" the regression line.

Cook's Distance captures both effects simultaneously, which is why the equivalent formula has a residual term and a leverage term.

Derivation

Starting Point

We want to measure how much the fitted values change when observation \(i\) is removed. The natural measure is the sum of squared differences between the two sets of predictions:

\[ D_i = \frac{(\hat{\mathbf{y}} - \hat{\mathbf{y}}_{(i)})^T (\hat{\mathbf{y}} - \hat{\mathbf{y}}_{(i)})}{p \cdot MSE} \]

Expressing in Terms of Coefficients

The fitted values are \(\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{\beta}}\), and the leave-one-out fitted values are \(\hat{\mathbf{y}}_{(i)} = \mathbf{X}\hat{\mathbf{\beta}}_{(i)}\). Therefore:

\[ D_i = \frac{(\hat{\mathbf{\beta}} - \hat{\mathbf{\beta}}_{(i)})^T (\mathbf{X}^T\mathbf{X}) (\hat{\mathbf{\beta}} - \hat{\mathbf{\beta}}_{(i)})}{p \cdot MSE} \]

Using the Sherman-Morrison-Woodbury Identity

Rather than refitting the model \(n\) times, we can use a matrix identity to show that removing observation \(i\) changes the coefficients by:

\[ \hat{\mathbf{\beta}} - \hat{\mathbf{\beta}}_{(i)} = \frac{(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i e_i}{1 - h_{ii}} \]

where \(e_i = y_i - \hat{y}_i\) is the ordinary residual and \(h_{ii}\) is the \(i\)-th diagonal element of the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\).

Substituting Back

Plugging this into the distance formula and simplifying:

\[ D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{(1 - h_{ii})^2} \]

This is the practical formula. It decomposes Cook's Distance into:

  • A residual component: \(\frac{e_i^2}{p \cdot MSE}\) — how poorly the model fits this point
  • A leverage component: \(\frac{h_{ii}}{(1 - h_{ii})^2}\) — how much this point can affect the fit

The Hat Matrix and Leverage

The hat matrix is defined as:

\[ \mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \]

It maps observed values to fitted values: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\). The diagonal element \(h_{ii}\) is the leverage of observation \(i\):

  • \(0 \leq h_{ii} \leq 1\)
  • Average leverage is \(\frac{p}{n}\)
  • High leverage means the point is far from the center of the predictor space

Thresholds for "Influential"

Common rules of thumb:

  • \(D_i > 1\): Almost always influential (widely used cutoff).
  • \(D_i > \frac{4}{n}\): A more conservative threshold, flags more points for review.
  • \(D_i > \frac{4}{n - p - 1}\): Adjusts for model complexity.

These are guidelines, not hard rules. Always investigate flagged points rather than automatically removing them.

Variables Explained

Symbol Name Description
\(D_i\) Cook's Distance Influence measure for observation \(i\)
\(e_i\) Residual \(y_i - \hat{y}_i\), the error for observation \(i\)
\(h_{ii}\) Leverage Diagonal of the hat matrix for observation \(i\)
\(p\) Number of Parameters Including the intercept (e.g., 2 in SLR)
\(n\) Sample Size Number of observations
\(MSE\) Mean Squared Error \(\frac{1}{n-p} \sum e_i^2\)
\(\hat{y}_{j(i)}\) Leave-one-out Prediction Prediction for \(j\) without observation \(i\)
\(\mathbf{H}\) Hat Matrix \(\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

Worked Example

Data: Simple linear regression with \(n = 4\) observations.

\(x\) \(y\)
1 2
2 4
3 5
10 30

Step 1: Fit the model. The OLS fit gives \(\hat{y} = -0.74 + 3.05x\).

  • \(MSE = 2.73\), \(p = 2\)

Step 2: Compute residuals and leverages.

\(i\) \(\hat{y}_i\) \(e_i\) \(h_{ii}\)
1 2.31 -0.31 0.46
2 5.36 -1.36 0.29
3 8.41 -3.41 0.18
4 29.72 0.28 0.83

(Note: Point 4 at \(x = 10\) has very high leverage \(h_{44} = 0.83\).)

Step 3: Compute Cook's Distance.

For observation 4:

\[ D_4 = \frac{(0.28)^2}{2 \times 2.73} \cdot \frac{0.83}{(1 - 0.83)^2} = \frac{0.078}{5.46} \cdot \frac{0.83}{0.029} = 0.014 \times 28.6 = 0.41 \]

For observation 3:

\[ D_3 = \frac{(-3.41)^2}{2 \times 2.73} \cdot \frac{0.18}{(1 - 0.18)^2} = \frac{11.63}{5.46} \cdot \frac{0.18}{0.67} = 2.13 \times 0.27 = 0.57 \]

Observation 3 has the highest Cook's Distance here — it has a large residual. Even though observation 4 has extreme leverage, its residual is small because it pulls the line toward itself.

Common Mistakes

  • High leverage = influential: Not necessarily. A point can have high leverage but sit right on the regression line (small residual), giving a low Cook's Distance.
  • Large residual = influential: Not necessarily. An outlier in \(y\) near \(\bar{x}\) has low leverage, so it may not shift the line much.
  • Automatically removing influential points: Cook's Distance flags points for investigation, not deletion. The point might be the most informative observation in your dataset.
  • Using only one threshold: Different thresholds flag different numbers of points. Always look at the distribution of \(D_i\) values, not just a single cutoff.