Independent and Identically Distributed (i.i.d.)¶

The History Behind the Assumption¶

The acronym i.i.d. (independent and identically distributed) is a cornerstone of modern statistics, but its roots lie in the study of games of chance in the 17th century.

Mathematicians like Gerolamo Cardano (1501-1576) and Blaise Pascal (1623-1662) intuited that to calculate probabilities in dice or card games, each "hand" or throw had to be treated as an isolated event, uninfluenced by previous ones, and governed by the same physical rules. However, formal rigor came much later.

Jacob Bernoulli (1654-1705), in his posthumous work Ars Conjectandi (1713), was the first to explicitly use these assumptions to prove the Law of Large Numbers. He reasoned about repeated processes (like drawing from an urn with replacement) where each trial was an exact and separate copy of the others.

Only in the 20th century, with the axiomatization of probability by Andrey Kolmogorov (1933), did the concept of stochastic independence receive the precise mathematical definition based on measure theory that we use today (\(P(A \cap B) = P(A)P(B)\)).

Today, the i.i.d. assumption is the "model zero" of almost every Machine Learning algorithm and statistical inference: we assume that the data we observe are drawn randomly from the same "bag" (distribution) and that drawing one does not change what we will find next.

Why It Matters¶

The i.i.d. assumption is what makes statistics tractable. Without it: 1. Complexity: If every data point influenced the others (dependence), we would have to model complex interactions (\(N\) data points would require estimating \(N^2\) relationships). 2. Generalization: If every data point followed a different rule (non-identical distribution), we could not learn anything from the past to predict the future.

It is fundamental for: - Hypothesis Testing: T-tests or ANOVA assume that groups are i.i.d. samples. - Machine Learning: Training and test sets must come from the same distribution (identical) and be independent, otherwise the model does not generalize (overfitting or data leakage). - Central Limit Theorem: Works (in its basic form) only for i.i.d. variables.

Prerequisites¶

Probability-Axioms (Basic probability concepts)
Random Variables and Probability Distributions
Variance (Measure of dispersion)
Covariance (Measure of linear dependence)

The Concept¶

A sequence of random variables \(X_1, X_2, \ldots, X_n\) is said to be i.i.d. if it satisfies two separate but simultaneous conditions:

Independence: The value assumed by one variable provides no information about the value of the others.
Identical Distribution: Each variable comes from the same probability distribution, with the same parameters (e.g., same mean \(\mu\) and variance \(\sigma^2\)).

1. Independence (Formally)¶

Two variables \(X\) and \(Y\) are independent if the joint probability is the product of the marginal probabilities.

For discrete events:

\[ P(X=x \text{ and } Y=y) = P(X=x) \cdot P(Y=y) \]

In terms of probability density functions (PDF) for continuous variables:

\[ f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \]

Intuitively: Knowing that \(X\) is high does not change my bet on \(Y\). The covariance between them is zero: \(\text{Cov}(X, Y) = 0\) (note: the reverse is not always true, zero covariance does not imply total independence, but independence implies zero covariance).

2. Identical Distribution (Formally)¶

Variables \(X_1\) and \(X_2\) are identically distributed if they have the same Cumulative Distribution Function (CDF):

\[ P(X_1 \le x) = P(X_2 \le x) \quad \text{for all } x \]

Intuitively: The mechanism generating the data does not change over time. We are not changing the die, we are not changing the coin, and the physical process remains stable.

Examples and Counterexamples: Understanding the Nuances¶

To truly understand i.i.d., let's analyze the 4 possible cases by crossing the two properties.

Case 1: i.i.d. (The Ideal)¶

Scenario: Rolling the same fair die 10 times. - Independent? Yes. The result of the first roll physically does not influence the second. The die has no "memory". - Identical? Yes. The die is always the same (always has 6 faces, probability 1/6 for each). - Result: We can use the Law of Large Numbers to say that the average of the rolls will tend to 3.5.

Case 2: Independent but NOT Identical¶

Scenario: Rolling a 6-sided die first, then a 20-sided die. - Independent? Yes. The roll of the first die does not touch the second. - Identical? No. The first has values in \([1,6]\), the second in \([1,20]\). They have different means and variances. - Consequence: It makes no sense to calculate a simple "sample mean" to estimate a single parameter, because the data come from different "populations".

Case 3: Identical but NOT Independent¶

Scenario: Drawing cards from a deck without replacement (or weather forecasts). - Identical? Yes (marginally). If I shuffle the deck and take the first card, the probability it's an Ace is 4/52. If I don't know anything about the first and only look at the second, the prior probability is still 4/52. Marginally, each draw has the same distribution. - Independent? No. If the first card is the Ace of Hearts, the second cannot be the Ace of Hearts. The conditional probability changes drastically. If it rains today, it is more likely to rain tomorrow compared to a random day. - Consequence: The variance of the sum is not the sum of the variances. "Double" information counts less than two pieces of new information.

Case 4: Neither Independent nor Identical¶

Scenario: The price of a stock during a crisis. - Identical? No. Volatility (variance) changes over time; today the market is calm, tomorrow it goes crazy. The distribution changes regime. - Independent? No. Today's price strongly depends on yesterday's price (autocorrelation). - Consequence: Complex models (like stochastic processes, GARCH, or time series) are needed to analyze these data. Basic statistical formulas fail.

Mathematical Implications (Simplifications)¶

The i.i.d. assumption drastically simplifies calculations.

1. Expected Value of the Product¶

If \(X, Y\) are independent:

\[ E[XY] = E[X]E[Y] \]

(Without independence, we would need to know the correlation).

2. Variance of the Sum¶

If \(X, Y\) are independent:

\[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \]

If they were not:

\[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \]

This explains why the variance of the sample mean \(\bar{X}\) drops as \(1/n\): the covariance terms are all zero.

3. Likelihood Function¶

To estimate parameters (e.g., Maximum Likelihood Estimation), we need to calculate the probability of observing the entire sample \(x_1, \ldots, x_n\). Thanks to independence, this is simply the product of individual probabilities:

\[ L(\theta) = P(x_1, \ldots, x_n | \theta) = \prod_{i=1}^n P(x_i | \theta) \]

Transforming into logarithms (\(\log L\)), the product becomes a sum, which is very easy to maximize (differentiate).

Common Errors¶

Gambler's Fallacy: Believing that if an i.i.d. event (e.g., "Red" at roulette) hasn't happened for a while, it is "due" or more likely on the next turn. Reality: If it is i.i.d., the coin has no memory. Probability is always 50%.
Ignoring Autocorrelation: Treating temporal data (e.g., daily sales) as i.i.d. when in reality today's sales depend on yesterday's. Consequence: Uncertainty is underestimated (Standard Error too small) and patterns are seen where there are none.
Selection Bias: Collecting data only from a subgroup (e.g., survey only on those with a landline) violates the assumption that data are identically distributed with respect to the target population.

Variables and Symbols¶

Symbol	Name	Description
\(X_i \sim D\)	Random Variable	The \(i\)-th observation distributed according to distribution \(D\)
\(\perp\) or \(\perp \!\! \perp\)	Independence Symbol	\(X \perp Y\) means X and Y are independent
\(F_X(x)\)	CDF (Cumulative Distribution)	Probability \(P(X \le x)\). Defines the distribution
\(f_X(x)\)	PDF/PMF (Density/Mass)	Point probability or density at \(x\)
\(\text{Cov}(X,Y)\)	Covariance	Measure of linear dependence. If i.i.d., it is 0
\(\mathcal{L}(\theta)\)	Likelihood	Likelihood function, product of marginal densities under i.i.d.

Sample-Mean-Estimator — The sample mean relies heavily on i.i.d.
Central Limit Theorem — The theorem that makes sums of i.i.d. Gaussian.
Variance — How it sums for independent variables.
Covariance — What happens when independence fails.
Likelihood-Based-Statistics — Construction of estimators based on i.i.d.

References¶

Bernoulli, J. (1713). Ars Conjectandi. Basel. (First formal application on repeated trials).
Kolmogorov, A. N. (1933). Foundations of the Theory of Probability. (Axiomatic definition of independence).
Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury Press. (Standard text for formal definitions).