Skip to content

Pearson's Correlation Coefficient

The Formula

\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

Or, in terms of sample statistics:

\[ r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} = \frac{\text{Cov}(x,y)}{s_x s_y} \]

What It Means

Pearson's \(r\) is a single number between \(-1\) and \(+1\) that tells you how much two variables change together in a linear way.

  • \(r = 1\): Perfect positive linear relationship. If \(x\) goes up, \(y\) goes up by a fixed proportion. The points lie exactly on a line with a positive slope.
  • \(r = -1\): Perfect negative linear relationship. If \(x\) goes up, \(y\) goes down. The points lie exactly on a line with a negative slope.
  • \(r = 0\): No linear relationship. Knowing \(x\) tells you nothing about \(y\) (linearly).

It measures the "tightness" of the cluster of points around a straight line. It does not measure the slope of that line (steepness), only how well the line fits.

Why It Works — The Story Behind the Formula

The Quest for Co-Relation

In the late 19th century, Francis Galton was obsessed with heredity. He noticed that tall parents tended to have tall children, but not as tall — they "regressed" toward the mean. He wanted a number to quantify the strength of this hereditary link.

He started by plotting the data. He drew lines through the means. He realized that if he standardized the variables (measured them in units of deviation), the slope of the regression line itself became a measure of the relationship's strength.

Karl Pearson, Galton's protégé, took this geometric intuition and forged it into the precise algebraic formula we use today. He realized that "correlation" is essentially the average product of standardized deviations.

Intuition: Multiplying Deviations

Look at the numerator: \(\sum (x_i - \bar{x})(y_i - \bar{y})\). - If a point is above the mean in both \(x\) and \(y\), both terms are positive. Product is positive (+). - If a point is below the mean in both \(x\) and \(y\), both terms are negative. Product is positive (+). - If a point is high in \(x\) but low in \(y\) (or vice versa), one is positive, one negative. Product is negative (-).

Summing these up gives a "net score" of agreement. - Mostly + products? Positive correlation. - Mostly - products? Negative correlation. - Mix of both? They cancel out to near zero.

The denominator just scales this sum so the result always lands between -1 and 1, effectively removing the units of measurement (meters, kilograms, dollars) from the equation.

Derivation

We can derive \(r\) by asking: What is the cosine of the angle between two centered vectors?

Geometric Derivation (Vector Interpretation)

Imagine our data as two vectors in \(n\)-dimensional space. Let centered vectors be:

\[ \mathbf{u} = \begin{bmatrix} x_1 - \bar{x} \\ \vdots \\ x_n - \bar{x} \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} y_1 - \bar{y} \\ \vdots \\ y_n - \bar{y} \end{bmatrix} \]

The dot product of these vectors is:

\[ \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \]

The length (Euclidean norm) of each vector is:

\[ \|u\| = \sqrt{\sum (x_i - \bar{x})^2}, \quad \|v\| = \sqrt{\sum (y_i - \bar{y})^2} \]

From linear algebra, the dot product is related to the angle \(\theta\) between the vectors:

\[ \mathbf{u} \cdot \mathbf{v} = \|u\| \|v\| \cos \theta \]

Solving for \(\cos \theta\):

\[ \cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} \]

This is exactly the formula for \(r\)! Therefore, Pearson's \(r\) is the cosine of the angle between the centered variable vectors.

  • If vectors point in the same direction (\(\theta = 0^\circ\)), \(\cos \theta = 1\).
  • If they point in opposite directions (\(\theta = 180^\circ\)), \(\cos \theta = -1\).
  • If they are orthogonal (uncorrelated, \(\theta = 90^\circ\)), \(\cos \theta = 0\).

Connection to Least Squares Slope

Recall the slope of the simple linear regression line:

\[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \]

And Pearson's \(r\):

\[ r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \]

We can rewrite \(\hat{\beta}_1\) in terms of \(r\):

\[ \hat{\beta}_1 = r \frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}} = r \frac{s_y}{s_x} \]

This confirms Galton's insight: if you standardize the data (so \(s_x = s_y = 1\)), the regression slope is the correlation coefficient.

Variables Explained

Symbol Name Description
\(r\) Pearson's correlation The measure of linear association (\(-1 \le r \le 1\))
\(x_i\) and \(y_i\) Data points Individual paired observations
\(\bar{x}\) and \(\bar{y}\) Means Average values of \(x\) and \(y\)
\(S_{xy}\) Sum of cross-deviations Numerator; measures covariance direction
\(S_{xx}\) and \(S_{yy}\) Sum of squared deviations Denominators; related to variance of \(x\) and \(y\)
\(\text{Cov}(x{,}y)\) Covariance Unscaled measure of association
\(s_x\) and \(s_y\) Sample standard deviations Measures of spread for each variable

Worked Example

Ice Cream Sales vs. Temperature

Day Temp (\(^\circ\)C) (\(x\)) Sales (\() (\)y$)
1 20 200
2 25 300
3 30 350

1. Calculate Means:

\[ \bar{x} = \frac{20+25+30}{3} = 25 \]
\[ \bar{y} = \frac{200+300+350}{3} = 283.33 \]

2. Calculate Deviations and Products:

\(x_i - \bar{x}\) \(y_i - \bar{y}\) Product \((x-\bar{x})^2\) \((y-\bar{y})^2\)
-5 -83.33 416.65 25 6943.89
0 16.67 0 0 277.89
5 66.67 333.35 25 4444.89
Sum 750 (\(S_{xy}\)) 50 (\(S_{xx}\)) 11666.67 (\(S_{yy}\))

3. Plug into Formula:

\[ r = \frac{750}{\sqrt{50} \cdot \sqrt{11666.67}} = \frac{750}{7.07 \cdot 108.01} = \frac{750}{763.76} \approx 0.98 \]

\(r \approx 0.98\) indicates a very strong positive correlation. Hotter days strongly predict higher sales.

Common Mistakes

  • Correlation \(\ne\) Causation: A high \(r\) does not mean \(x\) causes \(y\). Both could be caused by \(z\) (confounding), or it could be a complete coincidence (spurious correlation).
  • Linearity Assumption: \(r\) only detects linear relationships. If \(y = x^2\) (a parabola), \(r\) might be 0 even though the relationship is perfect and deterministic. Always plot your data!
  • Outliers: A single outlier can drastically swing \(r\) toward 0 or 1. It is not a robust statistic.
  • Restriction of Range: Measuring height vs. basketball skill only among NBA players will show a low correlation, because you've eliminated the variation.

History

  • 1888Francis Galton introduces the concept of "co-relation" in Co-relations and their Measurement. He uses graphical methods to estimate it.
  • 1895Karl Pearson develops the mathematical formula, the "product-moment correlation coefficient," in Note on Regression and Inheritance in the Case of Two Parents.
  • 1915-1920Ronald Fisher derives the exact sampling distribution of \(r\), allowing for hypothesis testing and confidence intervals, transforming it from a descriptive statistic to an inferential tool.

References

  • Pearson, K. (1895). "Note on regression and inheritance in the case of two parents." Proceedings of the Royal Society of London.
  • Galton, F. (1888). "Co-relations and their measurement, chiefly from anthropometric data."
  • Rodgers, J. L., & Nicewander, W. A. (1988). "Thirteen ways to look at the correlation coefficient." The American Statistician.