Pearson's Correlation Coefficient¶

The Formula¶

\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

Or, in terms of sample statistics:

\[ r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} = \frac{\text{Cov}(x,y)}{s_x s_y} \]

What It Means¶

Pearson's $r$ is a single number between $-1$ and $+1$ that tells you how much two variables change together in a linear way.

$r = 1$: Perfect positive linear relationship. If $x$ goes up, $y$ goes up by a fixed proportion. The points lie exactly on a line with a positive slope.
$r = -1$: Perfect negative linear relationship. If $x$ goes up, $y$ goes down. The points lie exactly on a line with a negative slope.
$r = 0$: No linear relationship. Knowing $x$ tells you nothing about $y$ (linearly).

It measures the "tightness" of the cluster of points around a straight line. It does not measure the slope of that line (steepness), only how well the line fits.

Why It Works — The Story Behind the Formula¶

The Quest for Co-Relation¶

In the late 19th century, Francis Galton was obsessed with heredity. He noticed that tall parents tended to have tall children, but not as tall — they "regressed" toward the mean. He wanted a number to quantify the strength of this hereditary link.

He started by plotting the data. He drew lines through the means. He realized that if he standardized the variables (measured them in units of deviation), the slope of the regression line itself became a measure of the relationship's strength.

Karl Pearson, Galton's protégé, took this geometric intuition and forged it into the precise algebraic formula we use today. He realized that "correlation" is essentially the average product of standardized deviations.

Intuition: Multiplying Deviations¶

Look at the numerator: $\sum (x_i - \bar{x})(y_i - \bar{y})$. - If a point is above the mean in both $x$ and $y$, both terms are positive. Product is positive (+). - If a point is below the mean in both $x$ and $y$, both terms are negative. Product is positive (+). - If a point is high in $x$ but low in $y$ (or vice versa), one is positive, one negative. Product is negative (-).

Summing these up gives a "net score" of agreement. - Mostly + products? Positive correlation. - Mostly - products? Negative correlation. - Mix of both? They cancel out to near zero.

The denominator just scales this sum so the result always lands between -1 and 1, effectively removing the units of measurement (meters, kilograms, dollars) from the equation.

Derivation¶

We can derive $r$ by asking: What is the cosine of the angle between two centered vectors?

Geometric Derivation (Vector Interpretation)¶

Imagine our data as two vectors in $n$-dimensional space. Let centered vectors be:

\[ \mathbf{u} = \begin{bmatrix} x_1 - \bar{x} \\ \vdots \\ x_n - \bar{x} \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} y_1 - \bar{y} \\ \vdots \\ y_n - \bar{y} \end{bmatrix} \]

The dot product of these vectors is:

\[ \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \]

The length (Euclidean norm) of each vector is:

\[ \|u\| = \sqrt{\sum (x_i - \bar{x})^2}, \quad \|v\| = \sqrt{\sum (y_i - \bar{y})^2} \]

From linear algebra, the dot product is related to the angle $\theta$ between the vectors:

\[ \mathbf{u} \cdot \mathbf{v} = \|u\| \|v\| \cos \theta \]

Solving for $\cos \theta$:

\[ \cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} \]

This is exactly the formula for $r$! Therefore, Pearson's $r$ is the cosine of the angle between the centered variable vectors.

If vectors point in the same direction ($\theta = 0^\circ$), $\cos \theta = 1$.
If they point in opposite directions ($\theta = 180^\circ$), $\cos \theta = -1$.
If they are orthogonal (uncorrelated, $\theta = 90^\circ$), $\cos \theta = 0$.

Connection to Least Squares Slope¶

Recall the slope of the simple linear regression line:

\[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \]

And Pearson's $r$:

\[ r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \]

We can rewrite $\hat{\beta}_1$ in terms of $r$:

\[ \hat{\beta}_1 = r \frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}} = r \frac{s_y}{s_x} \]

This confirms Galton's insight: if you standardize the data (so $s_x = s_y = 1$), the regression slope is the correlation coefficient.

Variables Explained¶

Symbol	Name	Description
$r$	Pearson's correlation	The measure of linear association ($-1 \le r \le 1$)
$x_i$ and $y_i$	Data points	Individual paired observations
$\bar{x}$ and $\bar{y}$	Means	Average values of $x$ and $y$
$S_{xy}$	Sum of cross-deviations	Numerator; measures covariance direction
$S_{xx}$ and $S_{yy}$	Sum of squared deviations	Denominators; related to variance of $x$ and $y$
$\text{Cov}(x{,}y)$	Covariance	Unscaled measure of association
$s_x$ and $s_y$	Sample standard deviations	Measures of spread for each variable

Worked Example¶

Ice Cream Sales vs. Temperature¶

Day	Temp ($^\circ$C) ($x$)	Sales ($) ($y$)
1	20	200
2	25	300
3	30	350

1. Calculate Means:

\[ \bar{x} = \frac{20+25+30}{3} = 25 \]

\[ \bar{y} = \frac{200+300+350}{3} = 283.33 \]

2. Calculate Deviations and Products:

$x_i - \bar{x}$	$y_i - \bar{y}$	Product	$(x-\bar{x})^2$	$(y-\bar{y})^2$
-5	-83.33	416.65	25	6943.89
0	16.67	0	0	277.89
5	66.67	333.35	25	4444.89
Sum		750 ($S_{xy}$)	50 ($S_{xx}$)	11666.67 ($S_{yy}$)

3. Plug into Formula:

\[ r = \frac{750}{\sqrt{50} \cdot \sqrt{11666.67}} = \frac{750}{7.07 \cdot 108.01} = \frac{750}{763.76} \approx 0.98 \]

$r \approx 0.98$ indicates a very strong positive correlation. Hotter days strongly predict higher sales.

Common Mistakes¶

Correlation $\ne$ Causation: A high $r$ does not mean $x$ causes $y$. Both could be caused by $z$ (confounding), or it could be a complete coincidence (spurious correlation).
Linearity Assumption: $r$ only detects linear relationships. If $y = x^2$ (a parabola), $r$ might be 0 even though the relationship is perfect and deterministic. Always plot your data!
Outliers: A single outlier can drastically swing $r$ toward 0 or 1. It is not a robust statistic.
Restriction of Range: Measuring height vs. basketball skill only among NBA players will show a low correlation, because you've eliminated the variation.

Simple Linear Regression — uses correlation to predict $y$
Covariance — the unstandardized numerator of $r$
R-Squared ($R^2$) — the square of $r$; represents the % of variance explained
Spearman's Rank Correlation — a non-parametric alternative for non-linear monotonic relationships

History¶

1888 — Francis Galton introduces the concept of "co-relation" in Co-relations and their Measurement. He uses graphical methods to estimate it.
1895 — Karl Pearson develops the mathematical formula, the "product-moment correlation coefficient," in Note on Regression and Inheritance in the Case of Two Parents.
1915-1920 — Ronald Fisher derives the exact sampling distribution of $r$, allowing for hypothesis testing and confidence intervals, transforming it from a descriptive statistic to an inferential tool.

References¶

Pearson, K. (1895). "Note on regression and inheritance in the case of two parents." Proceedings of the Royal Society of London.
Galton, F. (1888). "Co-relations and their measurement, chiefly from anthropometric data."
Rodgers, J. L., & Nicewander, W. A. (1988). "Thirteen ways to look at the correlation coefficient." The American Statistician.

Symbol	Name	Description
\(r\)	Pearson's correlation	The measure of linear association (\(-1 \le r \le 1\))
\(x_i\) and \(y_i\)	Data points	Individual paired observations
\(\bar{x}\) and \(\bar{y}\)	Means	Average values of \(x\) and \(y\)
\(S_{xy}\)	Sum of cross-deviations	Numerator; measures covariance direction
\(S_{xx}\) and \(S_{yy}\)	Sum of squared deviations	Denominators; related to variance of \(x\) and \(y\)
\(\text{Cov}(x{,}y)\)	Covariance	Unscaled measure of association
\(s_x\) and \(s_y\)	Sample standard deviations	Measures of spread for each variable