Spearman's Rank Correlation¶

The Formula¶

If there are no tied ranks:

\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]

Where \(d_i = \text{rank}(x_i) - \text{rank}(y_i)\) is the difference between the two ranks of each observation.

If there are ties, simply use the standard Pearson Correlation formula on the ranks.

What It Means¶

Spearman's \u03c1 (rho) measures the strength and direction of a monotonic relationship between two ranked variables.

Monotonic: As \(x\) increases, \(y\) tends to increase (or decrease), but not necessarily at a constant rate. A curve is fine, as long as it doesn't reverse direction.
Ranked: It doesn't care about the raw values (like "100 meters"), only the order ("1st place", "2nd place").

It answers: "If I sort the data by \(x\), is it also sorted by \(y\)?"

Why It Works — The Intuition¶

Spearman's correlation is literally just Pearson's correlation applied to ranks.

Imagine converting your raw data into ranks (1, 2, 3...). The smallest value becomes 1, the next 2, etc. * If \(x\) and \(y\) are perfectly monotonically related, their ranks will be identical (\(1 \to 1, 2 \to 2\)). The difference \(d_i\) will be 0 for everyone. The sum \(\sum d_i^2\) will be 0. The formula becomes \(1 - 0 = 1\). Perfect correlation! * If they are perfectly opposite, the ranks will be reversed (\(1 \to n, 2 \to n-1\)). This maximizes \(\sum d_i^2\), making the result \(-1\).

Using ranks removes the influence of extreme outliers and non-linear shapes, focusing purely on the order.

Derivation¶

Start with the Pearson formula for variables \(R_x\) and \(R_y\) (the ranks):

\[ r = \frac{\sum (R_{xi} - \bar{R})(R_{yi} - \bar{R})}{\sum (R_{xi} - \bar{R})^2} \]

Since the ranks are always the integers \(1, 2, \dots, n\): 1. The Mean: The sum of integers \(1 \dots n\) is \(\frac{n(n+1)}{2}\), so the mean is \(\bar{R} = \frac{n+1}{2}\). 2. The Variance: The sum of squared deviations of integers \(1 \dots n\) is fixed: \(\sum (R_i - \bar{R})^2 = \frac{n(n^2-1)}{12}\).

Substituting these known constants into the Pearson formula and simplifying leads algebraically to the shortcut formula:

\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]

This derivation assumes no ties (no duplicate values), which ensures the ranks are exactly the set {1, \dots, n}.

Variables Explained¶

Symbol	Name	Description
\(\rho\) (rho)	Spearman's Correlation	The correlation coefficient for ranks
\(d_i\)	Rank Difference	\(\text{rank}(x_i) - \text{rank}(y_i)\)
\(n\)	Sample Size	Number of paired observations
\(\sum d_i^2\)	Sum of Squared Differences	Measure of disagreement between rankings

Worked Example¶

Study Hours vs. Test Rank (Non-linear relationship)

Student	Hours	Grade	Rank (Hours)	Rank (Grade)
A	1	50	1	1
B	10	80	2	2
C	100	95	3	3

Notice: A Pearson correlation on raw data (1, 10, 100) vs (50, 80, 95) would be low because of the huge jump to 100. But Spearman looks at ranks: * Hours Ranks: 1, 2, 3 * Grade Ranks: 1, 2, 3

\(d = 0, 0, 0 \to \sum d^2 = 0\).

\[ \rho = 1 - \frac{6(0)}{3(3^2 - 1)} = 1 \]

Perfect Spearman correlation, even though the relationship isn't a straight line.

Common Mistakes¶

Using the shortcut formula with ties: If many values are the same (e.g., two people tied for 2nd place), the shortcut formula is inaccurate. Use the standard Pearson formula on the ranks instead.
Confusing Monotonic with Linear: Spearman = 1 just means "always increasing," not "increasing in a straight line."

Pearson's Correlation — The parametric equivalent for linear relationships.
Kendall's Tau — Another rank-based correlation, often better for small samples or many ties.